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' We study the efficiency of l/-fold cross-validation (VFCV) for 

, model selection from the non-asymptotic viewpoint, and suggest an 

improvement on it, which we call "V-fold penalization". 
Mh I Considering a particular (though simple) regression problem, we 

■ prove that VFCV with a bounded V is suboptimal for model selec- 
tion, because it "overpenalizes" all the more that V is large. Hence, 
asymptotic optimality requires V to go to infinity. However, when 
the signal-to-noise ratio is low, it appears that overpenalizing is nec- 
essary, so that the optimal V is not always the larger one, despite of 

, the variability issue. This is confirmed by some simulated data. 

In order to improve on the prediction performance of VFCV, we 
define a new model selection procedure, called 'V-fold penalization" 
(penVF). It is a V-fold subsampling version of Efron's bootstrap 
penalties, so that it has the same computational cost as VFCV, while 
\ being more flexible. In a heteroscedastic regression framework, assum- 

^ ■ ing the models to have a particular structure, we prove that penVF 

\^ ' satisfies a non-asymptotic oracle inequality with a leading constant 

, that tends to 1 when the sample size goes to infinity. In particular, 

l/^ ■ this implies adaptivity to the smoothness of the regression function, 

' even with a highly heteroscedastic noise. Moreover, it is easy to over- 

, penalize with penVF, independently from the V parameter. A simu- 

f"**) ' lation study shows that this results in a significant improvement on 

QQ ' VFCV in non-asymptotic situations. 

o ■ 
> 

^ ' 1. Introduction. There are typically two kinds of model selection criteria. On the one- 

■ hand, penalized criteria are the sum of an empirical loss and some penalty term, often measuring 
the complexity of the models. This is the case of AIC (Akaike |Aka73| ) . Mallows' Cp or Cl 
(Mallows |Mal73| ) and BIC (Schwarz [Sch78] ). to name but a few. On the other hand, cross- 
validation (Allen |A1174j . Stone |Sto74j . Geisser [Gei75j ) and related criteria are based on the 
idea of data splitting. Part of the data (the training set) is used for fitting each model, and the 
rest of the data (the validation set) is used to measure the performance of the models. There 
are several versions of cross-validation (CV), e.g. leave-one-out (LOO, also called ordinary CV), 
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leave-p-out (LPO, also called delete-p CV) and generalized CV (Craven and Wahba |CW79| ). 
In practical applications, cross-validation is often computationally very expensive. This is why 
less greedy CV algorithms have been proposed, among which IZ-fold cross-validation (VFCV, 
Geisser jGei7 5]) and repeated learning testing methods (Breiman et al. [ BFQS84] ). In this article, 
we mainly consider VFCV — which seems to be the most widely used nowadays — when the 
goal of model selection is to be efficient, i.e. to minimize the prediction risk among a family of 
estimators. Let us emphasize that this is quite different from picking up the "true model" , which 
is often recalled as the identification or consistency issue. 

The properties of CV (in particular leave-p-out) for prediction and model identification have 
been widely studied from the asymptotical viewpoint. It typically depends on the splitting ratio, 
i.e. the ratio between the sizes of the validation and training sets {p/{n — p) in the leave-p-out 
case; 1/(^ — 1) for TZ-fold cross-validation). This has been shown for instance by Shao |Sha97j 
(for regression on linear models) and by van der Laan, Dudoit and Keles [vdLDK04j (for density 
estimation). Asymptotic optimality occurs when this ratio goes to zero at infinity, as shown by 
Li |Li87| for the leave-one-out, and generalized by Shao |Sha97| for the leave-p-out with p <C n, 
both in the regression setting, when all the models are linear. Other asymptotic results about 
CV in regression can be found in the book by Gyorfi et al. |GKKW02j . and in the paper of 
van der Laan, Dudoit and Keles |vdLDK04] for density estimation. Notice that the behaviour 
of these procedures changes completely when the goal is consistency; we refer to Yang |Yan07| 
and Sect. 15.41 below for references on this problem. 

When it comes to practical application, a major question is how to choose the tuning pa- 
rameters of CV procedures, since their performance strongly depend on them. In the case of 
VFCV, this means choosing V. Basically, there are three competing factors. First, the VFCV 
estimator of the prediction error, critvFCV; is biased, and its bias decreases with V. As shown by 
Burman [Bur891 [BurOOj . it is possible to correct this bias; otherwise, V should not be taken too 
small. Second, the variance of critvFCV depends on V: it is always decreasing for small values 
of V, but then it can either stay decreasing (as in the linear regression case |Bur89] ) or start to 
increase before V = n (as in some classification problems [B re961 IHTFOH IMSP05] or in density 
estimation |CR08] : see Sect. 12. 3p . Third, the computational cost of VFCV is proportional to V , 
so that the theoretic optimum (taking only bias and variability into account) can not always 
be computed. More precisely, it is necessary to understand well how the performance of VFCV 
depends on V before taking into account the computational cost. This is one of the purposes of 
this article. 

We here aim at providing a better understanding of some CV procedures (including VFCV) 
from the non- asymptotic viewpoint. This may have two major implications. First, non-asymptotic 
results are made to handle collections of models which may depend on the sample size n: their 
sizes may typically be a power of n, and they may contain models whose complexities grow 
with n. Such collections of models are particularly significant for designing adaptive estimators 
of a function which is only assumed to belong to some holderian ball, which may require an 



y-FOLD PENALIZATION 



3 



arbitrarily large number of parameters. Second, in several practical applications, we are in a 
"non-asymptotic situation" in the sense that the signal-to-noise ratio is low. We shall see in the 
following that it should really be taken into account for an optimal tuning of V. It is worth 
noticing that such a non-asymptotic approach is not common in the literature, since most of the 
results already mentioned are asymptotic, and none is considering our second point above. 

Another important point in our approach is that our framework includes several kinds of 
heteroscedastic data. We only assume that the observations (Xj,yj)i<j<„ are i.i.d. with 

Yi = s{X,) + a{X,)ei , 

where s : A" i-^ M is the (unknown) regression function, o" : A" i-^ M is the (unknown) noise-level, 
and ei has a zero mean and a unit variance conditionally to Xi. In particular, the noise-level 
(7{X) can be strongly dependent from X, and the distribution of e can itself depend from X. 
Such data are generally considered as very difficult to handle, because we have no information 
on a, making irregularities of the signal harder to distinguish from noise. Then, simple model 
selection procedures such as Mallows' Cp may not work (see Chap. 4 of |Arl07] for a theoretical 
argument), and it is natural to hope that VFCV or other resampling methods may be robust to 
heteroscedasticity. In this article, both theoretical and simulation results confirm this fact. 

In Sect. [21 we provide a non-asymptotic analysis of the performance of VFCV. The aforemen- 
tioned bias turns out into a non-asymptotic negative result (Thm. [T]), showing a rather simple 
problem for which VFCV can not satisfy an oracle inequality with leading constant smaller than 
K(y) — e„, with K(y) > 1 for any V > 2 and e„ — > 0. In particular, VFCV with a bounded V 
can not be asymptotically optimal. But our analysis also has a major positive consequence in 
some "non-asymptotic" situations. Indeed, by considering VFCV as a penalization procedure, 
our previous result can be interpretated as an overpenalization property of VFCV. This should 
be related to the fact that the efficiency of penalization methods (like Mallows' Cp) is often 
improved by overpenalization, when the signal-to-noise ratio is small. Then, one can expect the 
optimal V for VFCV to be smaller than n, even for least-squares regression, which is confirmed 
by the simulation study of Sect. [H So, it appears that choosing the optimal V for VFCV may 
be quite hard. In addition, the optimal choice may not be satisfactory when it corresponds to a 
highly variable criterion such as the 2-fold CV one. It is likely that there is some room left here 
to improve on VFCV. 

This is why we propose in Sect. [3] another T^-fold algorithm, that we call "l^-fold penalization" 
(penVF). It is based upon Efron's resampling heuristics |Efr79] . in the same way as Efron's 
bootstrap penalty |Efr83| . but with a F-fold subsampling scheme instead of the bootstrap. It 
thus has exactly the same computational cost as the classical VFCV, and our results show that is 
has a similar robustness property, in some heteroscedastic regression framework. In addition, it 
turns out to be a generalization of Burman's corrected VFCV |Bur89l IBurQOj (at least when the 
splitting into V blocks is regular). The main advance of penVF being that it is straightforward 
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to overpenalize within any factor when this is required, for instance when the signal-to-noise 
ratio seems low. 

In the least-square regression framework, when we have to select among histogram models 
(see Sect. 12.21 for an accurate definition), we prove that penVF satisfies a non-asymptotic oracle 
inequality with a leading constant almost one (Thm.[2]). To our knowledge, such a non-asymptotic 
result is new for any l^-fold model selection procedure. One of its strengths is that it requires 
very few assumptions on the noise, allowing in particular heteroscedasticity. It is a strong result 
for penVF — which was not built for this particular setting at all — to improve on VFCV for 
such difficult problems, where VFCV is among the best procedures overall. As a consequence of 
Thm.[2l one can use penVF with the family of regular histograms in order to obtain an estimator 
adaptive to the smoothness of the regression function, when the noise is heteroscedastic (while 
having no information at all on the distribution of the noise). Notice that we only consider this 
result as a first step towards a more general theorem, without the restriction to histograms, as 
discussed in Sect. 15.31 The main interest of this toy framework is that we can study it deeply, 
and then derive general heuristics for practical use. 

As an illustration to our theoretical study, we provide the results of a simulation study in 
Sect, m It confirms the good performances of penVF against both VFCV and the simpler Mal- 
lows' Cp criterion, in particular for difficult heteroscedastic problems. We also show how useful 
may be the flexibility of penVF when the signal-to-noise ratio is low. By decoupling V from the 
overpenalization factor, we allowed a signiflcant improvement of the performance of both VFCV 
and its bias-corrected version. 

Finally, our results are discussed in Sect. [5l The remaining of the paper is devoted to some 
probabilistic tools (App. VK\ and proofs (App. [B]) . 

2. Performance of V-fold cross-validation. In this section, we provide a non-asymptotic 
study of y-fold cross-validation (VFCV) in the least-squares regression framework. In order to 
make explicit computations possible, we focus on the case where each model is an "histogram 
model", i.e. the vector space of piecewise constant functions on some fixed partition of the 
feature space. This is only a first theoretical step. We use it to derive heuristics, that should 
help the practical user of VFCV in any framework. Notice also that we do not assume that the 
regression function itself is piecewise constant. 

2.1. General framework. First consider the general prediction setting: X \s a, measurable 
space, P an unknown probability measure on it and we observe some data (Xi, Yi), . . . , (X„, y„) G 
X X y ol common law P. Let S be the set of predictors (measurable functions X ^ y) and 
^ : S X [X X y) <^ ^ & contrast function. Given a family {sm)m.&M„ of data-dependent predic- 
tors, our goal is to find the one minimizing the prediction loss P^{t) := E(x.y)~p[7(i; (-^;^))]- 
Notice that the expectation here is only taken w.r.t. (X, y), so that P'y{t) is random when t is 
random {e.g. data-driven). Assuming that there exists a minimizer s G 5 of the loss (the Bayes 
predictor), we will often consider the excess loss l{s,t) = P^{t) — P'y{s) > instead of the loss. 
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We assume that each predictor Sm can be written as a function Sm{Pn) of the empirical 
distribution of the data Pn = 'n~^J27=i\xi,Yi)- "^^^ case-example of such a predictor is the 
empirical risk minimizer Sm S argmin^g^^ {P„7(t)}, where Sm is any set of predictors (called a 
model). In the classical version of VFCV, we first choose some partition {Bj)Kj<v of the indexes 
{!,... ,n}. Then, we define 



" "n-Card(i?,)^.f- ™ -'^V^ ) 



The final VFCV estimator is s-: (P„) with 

™VFCV ^ ' 



1 

(1) mvFCV £ arg min {critvFCv("i)} and critvFCv("^) := 77 -Rrt"'^7 ( 

It is classical to assume that the partition [Bj)i<j<y is regular, i.e. that Vj, |Card(i?j) — < 
1. In order to understand deeply the properties of VFCV, we have to compare precisely critypcv 
to the excess loss l{s, Sm). A crucial point is to compare their expectations, which is quite hard 
in general. This is why we restrict ourselves to a particular framework, namely the histogram 
regression one. We describe it in the next subsection. 

2.2. The histogram regression case. In the regression framework, the data G A" x M 

are i.i.d. of common law P. Denoting by s the regression function, we have 

(2) Yi = s{Xi) + a{Xi)ei 

where cr : A' 1— > R is the heteroscedastic noise-level and ej are i.i.d. centered noise terms, possibly 
dependent from Xi, but with mean and variance 1 conditionally to Xj. In order to simplify 
the theory, we will make two main assumptions on the data throughout this paper: 

cr{X) > (Jmin > a.s. and Halloo ^ ^ < . 

Notice that we do not assume cTmin and A to be known from the statistician. Moreover, those two 
assumptions can be relaxed, as shown by Chap. 6 and Sect. 8.3 of [ ArlOTj . The feature space X 
is typically a compact subset of M"^. We use the least-squares contrast 7 : {t, {x, y)) ^ {t{x) — y)^ 
to measure the quality of a predictor t : X y. As a. consequence, the Bayes predictor is the 

\xxy 

Sm, we associate the empirical risk minimizer 



regression function s, and the excess loss is l{s,t) = E(x,y)~p {^{X) — s(X))^. To each model 



Sm{Pn) = arg min {P„7(t)} 
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(when it exists and is unique). Define also Sm '■= argmiutg^^ P^{t). 

We now focus on histograms. Each model in {Sm)mGM„ is the set of piecewise constant 
functions (histograms) on some partition (/A)AGAm of X- It is thus a vector space of dimension 
Dm = Card(Am), spanned by the family (l/;^)AeAm- As this basis is orthogonal in L'^ifJ-) for any 
probability measure // on X, we can make explicit computations. The following notations will 
be useful throughout this article. 



PX := P{X G h) px := Pn{X G h 



-I 



E 



'Y-s{X)\ 



Remark that Sm is uniquely defined if and only if each Ix contains at least one of the Xi, i.e. 
miuAgAm {pa} > 0. Prop. [1] below compares the y-fold criterion and the ideal criterion P^{sm) 
in expectation. 



Proposition 1. Let Sm he the model of histograms associated with the partition {Ix)xeA„ 
and {Bj) 1 < j < V some "almost regular" partition of {1, . . . ,n} , i.e. such that 



max 

j 



n 



sup 

j 



Card(5,- 



1 



n 



V 



< cb < 1 and 

Then, the expectation of the ideal and V-fold criteria are respectively equal to 
(3) E[P^ism)]=P7{sm) + - J2 i^ + <P.)^l 







n 



AeA„ 



(4) E [critvFCv(m) ] = Pjis^) + x ^ E ( 1 + ) ^. 

AeAm 



where 5n,p only depends on {n,p), S^jf"^ depends on {n,p) and the partition (i3j)i<j<y, but both 
are small when the product np is large: 



\^n,p\ < Li 



and 



SiVF) 



e^^ + max 



{{np) ^ 



/4 g-np(l-cs) 



where Li is a numerical constant ant L2 only depends on cb- 



Remark 1. Since we deal with histograms, Sm is not defined when minAgA™ Pa = 0, which 
occurs with positive probability. We then have to take a convention for P7 ( Sm ) (on the event 
miuAeAm {px} = 0, which has generally a very small probability) so that it has a finite expecta- 
tion. The same kind of problem occur with critvpcv- See the proof of Prop.[TJ 



Prop. [T]is consistent with Burman's asymptotic estimate of the bias of VFCV |Bur89j . The 
major advance here is that it is non-asymptotic, and we have explicit upper bounds on the 
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remainder terms (see the proof of Prop. [T] in App. IB^ . It shows that the classical T^-fold cross- 
validation overestimates the variance term X^AeA because it estimates the generalization 

ability of Sm , which is built upon less data than Sm- This interpretation is consistent with the 
results of Shao |Sha97| on linear regression, and van der Laan, Dudoit and Keles |vdLDK04] in 
the density estimation framework. 

When V stays bounded as n grows to infinity, it is then natural to think that VFCV is 
underfitting, and thus be suboptimal for prediction. Since Prop. [T] is non-asymptotic and quite 
accurate, we are now in position to prove such a result. 

Theorem 1. Let n G N, (Xj,l^,)i<j<„ be i.i.d. random variables, with X ~ Z^([0, 1]) and 
Y = X + ae with cr > 0, E[e| X] = o'Ele"^ | ^] = 1 and ||e||^ < +oo. Let Mn = {I, ■ ■ ■ ,n} 
and Vm G Mn, Sm be the model of regular histograms with Dm = m pieces on X = [0,1]. 
Let V G {2, and {Bj)^^-^y be some partition o/{l,...,n} such that for every j, 

\Card{Bj) -nV-^\ <1. 

Then, there is an event of probability at least 1 — i^in"^ on which 

(5) l{s,s-^^^^)> {1 + k{V)- Hn)-^/^) inf {/(s,?^)} , 

for some constant k{V) > depending only on V (and decreasing as a function ofV), and a 
constant Ki which depends on a, A and V . 

We now make a few comments: 

• In the same framework, using similar arguments, we can prove an upper bound on /(s, '^mY^cY' 
showing that the constant 1 + k,{V) is exact (up to the ln(n)~"'^/^ term). In particular, 



1 + k{V) = 1 + 



1 



V 



2 



> 1 



When (-Bj)-^^^.^y is not assumed regular, the proof of Prop. [T] shows that the factor 
V/ {V — 1) becomes J2j=i n/ {n — Card(i?j)) which is always larger, because x i— > (n — x)~^ 
is convex. On the other hand, if one chooses a (Xj)i<j<„-dependent partition such that for 
every A G A^, CardjXj G I\ and i £ Bj} is (almost) independent from j, then a similar 

proof shows that sii^j^^ is made much smaller than the previous upper bound. In a nutshell, 
it seems that the best performance of VFCV corresponds in general to the regular partition 
case, for which ([5]) holds. 

Although we restrict in Thm. [1] to a very particular problem, a similar result stays valid 
much more generally, possibly with a different value for the constant n{V). The only 
purpose of our assumptions is to compare very precisely critypcv ("t-) and Pj{sm) as 
functions of m. Since ^m^pf^^ is smaller than the optimum from a multiplicative factor 
independent from n only, this analysis strongly depends on how P7 (sm) varies with m. 
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• One can easily extend this result to any cross-validation like method, when two conditions 
are satisfied. First, the ratio between the size of the training set and n has to be upper- 
bounded by 1 — < 1 (uniformly in n). Second, the number of training sets considered 
has to be bounded by -Bmax (from which Ki may depend). This includes for instance the 
hold-out case, and repeated learning-testing methods. Notice that the second assumption 
is mainly technical; if we were able to prove the corresponding concentration inequalities, 
the leave-p-out with p ~ n/V should have approximately the same properties. 

2.3. How to choose V . 

2.3.1. Classical analysis. There are three well-known factors to take into account in order to 
choose V: 

• bias: when V is too small, critypcv overestimates the variance term in P'y{sm), which 
leads to underfitting and suboptimal model selection (Thm. [1]). 

• variability: the variance of critvFCv("^) is a decreasing function of y, at least in the linear 
regression framework (see Burman |Bur89] for an asymptotic expansion of this variance). 
In general, V = 2 is known to be quite variable because of the single split. When the 
prediction algorithm {Xi,Yi)^^-^^ Sm is unstable {e.g. classification with CART, as 
noticed by Hastie, Tibshirani and Friedman [HTFOlj : see also Breiman |Bre96] ). the leave- 
one-out criterion [i.e. V = n) is also known to be quite variable, but this phenomenon 
seems to disappear when Sm is more stable (Molinaro, Simon and Pfeiffer |MSP05] ). In 
particular, in the least-squares regression framework, the variance of critvFCv("i) should 
decrease with V . 

• computational complexity: V-iold cross-validation needs to compute at least V empirical 
risk minimizers for each model. 

In the least-squares regression setting, V has to be chosen large in order to improve accuracy 
(by reducing bias and variability) ; on the contrary, computational issues arise when V is too big. 
This is why V = f> and V = are very classical and popular choices. 

2.3.2. The non- asymptotic need for overpenalization. We now come to some particularity of 
the non-asymptotic viewpoint. Indeed, our proof of Thm. [T]shows that the asymptotic behaviour 
of hold-out and cross-validation criterions only depend on their bias, because all these criterions 
are sufficiently close to their expectations asymptotically. However, this is not true when the 
sample size is fixed, and even the less variable criterions are far from being deterministic. As a 
consequence, using an unbiased estimator is no longer a guarantee of being optimal, since it can 
still lead to choosing a very poor model with a positive probability. 

In order to analyze this phenomenon, it is useful to take the penalization viewpoint. The idea 
of penalization for model selection is to define 
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Overoenalization constant 



Fig 1. The non- asymptotic need for overpenalization: the prediction performance Cor (defined in Sect. |^. of 
the model selection procedure ([6]) with pen(m) = CovE [pen;jj(m) ] is represented as a function ofCov Data and 
models are the ones of experiment (SI): n = 200, cr = 1, s{x) = sin(7ra;). See Sect. \^for more details. 

where pen(m) is chosen so that P„7 {sm) + pen(m) is close to the prediction error P7 (sm)- In 
other words, the "ideal penalty" is 

(7) penid(m) := [P - Pnh{sm) ■ 

According to Prop. [1] and (j38p (which follows its proof), in the histogram regression case, we 
can compute the expectation of the ideal penalty: 

(8) E[penid(m)] = i J2 (2 + <5„,pJaf , 

which is close to Mallows' Cp penalty 2a'^Dmn~^ in the homoscedastic case. The point is that 
overpenalization (that is, taking pen larger than penj^, even in expectation) can improve the 
prediction performance of when the signal-to-noise ratio is small. This can be seen on Fig. [H 
according to which the optimal overpenalization constant C*^ seems to be between 1.2 and 1.7 for 
this particular model selection problem. See also |Arl07| for a longer discussion of this problem. 

2.3.3. Choosing V in the non- asymptotic framework. Since l^-fold cross-validation is choosing 
the model mvFCV which minimizes some criterion critvFCV; it can be written as a penalization 
procedure: it satisfies ^ with 

peuvFcvl"^) ■= critvFCv("^) - Pnl {Sm) ■ 
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Using again Prop. [Hand (pSil . we can compute its expectation: 

E[penvFcv("^)] = ~ II 

AeA^ 

Compared to this shows that T^-fold cross- vahdation is overpenahzing within a factor 1 + 

i/(2(y-i)). 

We can now revisit the question of choosing V for optimal prediction, in such a non-asymptotic 
situation: 

• the overpenalization factor is 1 + l/{2(y — !))• 

• the variance of critvFCV roughly decreases with V. 

• the computational complexity of computing critvFCV is roughly proportional to V. 

First, take only the prediction performance into account. The variability question should be 
less crucial than overpenalization, because the variance of critvFCV depends only on V through 
second order terms, according to the asymptotic computations of Burman [Bur89] . Since the 
optimal overpenalization constant is C*^ > 1, the performance of F-fold cross-validation should 
be optimal for some V* < n. This analysis is confirmed by the simulation study of Sect.lU where 
V = 2 provides better performance than V = 5 and V = 10 for several different experiments. 

Now, if computational cost comes into the balance, or if we consider less stable prediction 
algorithms than least-squares regression estimators, the optimal V may be even smaller. What- 
ever the framework, it seems quite difficult to find the optimal V, even if C*^ was known (which 
is far from being the case in general) . It would be at least necessary to understand well how the 
variance of critypcv depends on V in the non-asymptotic framework. This is a difficult practical 
problem, since "there is no universal (valid under all distributions) unbiased estimator of the 
variance of V-fold cross-validation" (Bengio and Grandvalet |BG04] ) . In the density estimation 
framework, this question has been tackled recently by Celisse and Robin |CR08] . 

The conclusion of this section is that choosing V for F-fold is a very complex issue in practice, 
even independently from the cost of computing critypcv- Moreover, it seems unsatisfactory to 
select a model according to a criterion as variable as the 2-fold cross-validation one when V* = 2 
because of the need for overpenalization. Finally, when the signal-to-noise ratio is large, we would 
like to obtain a nearly unbiased procedure without having to take V very large, which can be 
computationally too heavy. 

In other words, we would like to decouple the choice of an overpenalization factor from the 
variability issue (which is essentially linked with complexity). The drawback of V^-fold cross- 
validation is that they both depend on the V parameter. As we shall see in the next section, 
such a decoupling can be naturally obtained through the use of penalization. 

3. An alternative V-fold algorithm: V-fold penalties. There are several ways to de- 
fine y-fold cross-validation like penalization procedures with a tunable overpenalization factor. 



V 

V -I 



1 + 



".PA J 



-I 
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independent from the V parameter. A first idea may be to multiply penYpQY{m) by a constant 
i.e. to use ([6]) with the penalty 

pen(m) = Cov 1 + 2(y^_ i) ) critvFCV (m) - P„7 ( ) ) • 

From the proof of Thm. [1] (see also the one of Thm. [2] below), it is clear that when Cqv ~ 1, 
this procedure satisfies with large probability a non-asymptotic oracle inequality with leading 
constant 1 + e^, and more generally an oracle inequality with leading constant K{Cov) > 1- 
However, this may seem a little artificial, and strongly dependent from the histogram regression 
framework in which the computations of Prop. [1] work. 

In this section, we consider another approach, that we call 'V-fold penalization" , which seems 
more natural to us. We shall see below that it is closely related to an idea of Burman |Bur891 
IBurQOj for correcting the bias of y-fold cross-validation. However, Burman did not consider his 
method as a penalization one. His goal was only to obtain an unbiased estimate of the prediction 
error, so that it is not straightforward to choose an overpenalization factor different from 1 with 
his method. This is a major difference with our approach. 

3.1. Definition of V -fold penalties. 

3.1.1. General framework. We come back to the general setting of Sect. 12.11 Recall that 
each predictor Sm can be written as a function Sm{Pn) of the empirical distribution of the data 
Pn = n~'^Yl'i=i^{Xi,Yi)- We want to build a penalization method, i.e. choose fh according to 
([6]), so that the prediction error of is as small as possible. This could be done exactly if 
we knew the ideal penalty penj(j(m) = {P — Pn)^{smiPn)), but this quantity depends on the 
unknown distribution P. Following a heuristics due to Efron |Efr79j . we propose to define pen 
as the resampling estimate of penjj, according to a T^-fold subsampling scheme. We first recall 
the general form of this heuristics. 

Basically, the resampling heuristics tells that one can mimic the relationship between P and 
Pn by building a n-sample of common distribution P„ (the "resample"). P}^ denoting the em- 
pirical distribution of the resample, the pair (P, P„) should be close (in distribution) to the pair 
{Pn, P}^) (conditionally to Pn for the latter distribution). Then, the expectation of any quantity 
of the form F{P, P^) can be estimated by Evi/ F{Pn, P^) , where Eiy [•] denotes expectation 
w.r.t. the resampling randomness. In the case of peuj^, this leads to Efron's bootstrap penalty 
|Efr83j . Later on, this heuristics has been generalized to other resampling schemes, with the ex- 
changeable weighted bootstrap (Mason and Newton |MN92] . Prsestgaard and Wellner |PW93| ). 
The empirical distribution of the resample then has the general form 

1 " 

P^ := — ^ Wi6(^Xi,Yz) with W G an exchangeable weight vector, 

Tl . 
1=1 

independent from the data {W is said to be exchangeable when its distribution is invariant 
by any permutation of its coordinates). Fromont |Fro07| used it successfully (with a particular 
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upper bound on peuj^) to build global penalties in the classification framework. Exchangeable 
resampling penalties (generalizing Efron's bootstrap penalty) have also been recently proposed, 
and studied in the regression framework [ArlOTj . The idea of y-fold penalties is to use a V^-fold 
subsampling scheme instead, i.e. take Wi = jrij'^i^Bj with J ~ U{{1, . . . ,V}) independent 

from the data (p({E) denotes the uniform distribution over the set E). Then, = Pi and 
we obtain the following algorithm. 

Algorithm 1 (F-fold penalization). 

1. Choose a partition (^j)i<j<y of { 1, . . . , n}, as regular as possible. 

2. Choose a constant C > Cw,oo = 1^ — 1- 

3. Compute the following resampling penalty for each m G Ain- 



C ^ 



pen(m) = penvF(m) := tfE [^"7 {P^'^)) " Pt'h (P, 



(-i) 



4. Choose m according to ([6]). 



Remark 2 (About the constant C). Contrary to Efron's resampling heuristics, we have to 
put a constant C 7^ 1 in front of the penalty (pen being an unbiased estimator of peuj^ when 
C = Cw,oo)- This is because each Wi has a variance {V — l)~^ 7^ 1 (we only normalized W so that 
E [Wi] = 1 for every i). According to Lemma 8.4 of |Arl07| . the right normalizing constant can 
be derived from the exchangeable case. As a consequence, from Theorem 3.6.13 in [vd V W96] . 

Cw,oc--n^oo (n-^f^EiWi-lf^ . 

The asymptotic value of Cw,oo can also be derived from the computations of Burman |Bur89j in 
the linear regression framework. Indeed, with our notations, Burman's criterion (formula (2.3) 
in [Bur89j l is 



1 

Critcorr.VF("T-) := CritvFCv("l) + PnJ (Sm) ~ T? X! ( 
= Pnl{Sm) + ^j2[{Pi'^-Pn)l (^^'^'^ 



If all the blocks of the partition have the same size n/V, then Pn^ —Pn = ~ ^){Pn — Pn "'^), so 
that Burman's corrected VFCV coincides exactly with IZ-fold penalization when C = V—1. Since 
critcorr.VF("i) is an asymptotically unbiased estimator of P7 (S"m) (at least for linear regression), 
the result follows. From the non-asymptotic viewpoint, we prove in Sect. 13.21 below that V — 1 
also leads to an unbiased estimator of peuj^ in the histogram regression case. 
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Notice also that we do not assume that C = Cw,oo, but only C > Cw,oo- This is a major 
quality of y-fold penalization (penVF): it is straightforward to choose any overpenalization 
factor, independently from V . Further comments about the choice of C and V are made in 
Sect. U 

3.1.2. The histogram regression case. We now come back to the framework of Sect. 12.21 in 
which we can analyze deeper Algorithm [H Remind that histograms are not our final goal, but 
only a convenient setting from which we can derive heuristics for practical use of penVF in any 
framework. From now on, (5'm)^g_^^ is a collection of histogram models and {'Sm)meMn 
associated collection of least-squares estimators. We first introduce some more notations: 

Sm= /^aI/, and s^= ^^'^ix with /3a = E [y | X G /a] and dx = -^- 



AeA„ 



AeA,, 



P^{XGh)=PxWx with Wx-.-- 



1 



and := arg mm 7(t) = ^ ^fl,, with ■-= E ^^^^ 



Assuming that minAeA^PA > (otherwise, the model m should clearly not be chosen), we 
can compute the ideal penalty (see ([37]) and ([38]) in Sect. IB.4P and its resampling estimate: 

,2 

5A 



penid(m) = {P - Pn)l{sm) 

(9) ^w[{Pn-P^)l{sZ) 

since = 1 implies that Evk 



E (^'A +Pa) (^a - /3a 
AeAm 



E 

AeA™ 



P\ + P\ 



+ {P- PnhiSm) 



0. The penalty ([9]) is well-defined if 



{Pn-P^h{Sm 

and only if is a.s. uniquely defined, i.e. W\ > for every A G a.s. This is why we modified 
the definition of the weights in algorithm [H so that this problem does not occur. 

Algorithm 2 (F-fold penalization for histograms). 

1. Replace Mn by Mn = {m G Mn s.t. miuAeA^ {npx] > 3}. 

2. Choose a constant C > Cw,oo = V — 1. 

3. For every m £ Ain, choose a partition (^j)i<j<y of { 1, . . . , n} such that 



VA G Am, VI < i < V, 



CaTd{Bjn{i s.t. Xi £ Ix}] 



npx 
V 



< 1 . 



4. Compute the following resampling penalty for each m G Mn- 

C ^ 

pen(m) = penvF(m) ■= yY [^"^ (^m^^) - Pn~^h (s^m^^) 



(10) 
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5. Choose m according to ([6]). 

At step 3, we choose a different partition for each model m. Our choice is consistent with the 
proposal of Breiman et al. [BFOS84j (see also Burman |Bur90| . Sect. 2) to stratify the data and 
choose a partition which respects the stratas. In the histogram case, natural stratas are the sets 
{i s.t. Xi G In particular, steps 1 and 3 of Algorithm [2] ensure that minjs^gA™ > for 
every m G M.ni so that ([TO]) is well-defined. 

Other modifications of algorithm [1] are possible. For instance, keep the same regular partition 
{Bj)i<_j<_y for all the models, and take 



(11) penvF("i) = C 

AeA„, 



E 



w 



Px 



Wx>0 



pY{f3. 



instead of ([9|). This is what we did in the simulations of Sect. [H and a short theoretical study 
of this method is done in Sect. 8.4.1 of |Arl07| . It confirms that the two algorithms should have 
very similar performances in practical applications. 

3.2. Expectations. We now come to the expectation of y-fold penalties, in the histogram 
regression framework. 



Proposition 2. Let Sm be the model of histograms associated with some partition iIx)xeA„ 
and pen = peuyp be defined as in Algorithmic Then, i/minAgA™ {n'Px} > 3, 



(12) 



E' 



AeAr, 



2C 



V 



1 + F 



C j(penV) \ 
— 1 ".Pa / 



with E^ 



E^ 



'1 



^i<^h )l<i<n,\£A„ 



and 



npx-2 



> > 0. 

".PA 



Comparing (fT2]) with ([8]), it appears that peuyp is an (almost) unbiased estimator of peuj^j 
when C = V — 1. Indeed, when minxi^A^ {npx} goes to infinity faster than some constant 
times ln(n), so does miuAgAm {'^Pa} with a large probability. Moreover, following the proof of 
Lemma O we can show that 



E 



.(pcnV)^ ^ 
n,px "Pa>3 



< Kmin [ 1, (npA) 



npx^oo 







for some absolute constant n > 0. This is consistent with the asymptotic computations of Burman 
[Bur89| . The main novelty of Prop. [2] is that we have an explicit non-asymptotic upperbound on 
the remainder term. This is crucial to derive oracle inequalities for Algorithm O 
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3.3. Oracle inequalities and asymptotic optimality. We are now in position to state the main 
result of this section: y-fold penalties (Algorithm [2]) satisfy a non-asymptotic oracle inequality 
with a leading constant close to 1, on a large probability event. This implies the asymptotic 
optimality of Algorithm [2] in terms of excess loss. For this, we assume the existence of some 
non-negative constants a^vj, cm, Crich, ^ such that: 

(PI) Polynomial complexity of 7W„: Card(7W„) < c^v/jn"-^. 
(P2) Richness of A4„: 3mo G Ain s.t. S [\^; Cj-ichV^]- 
(P3) The constant C is weU chosen: r]{V -1)>C>V-1. 

Theorem 2. Assume that the {Xi,Yi) 's satisfy the following: 
(Ab) Bounded data: \\Yi\\^ < A < oo. 

(An) Noise-level bounded from below: cr{Xi) > Umin > a.s. 

(Ap) Polynomial decreasing of the bias: there exists Pi > P2 > and C^,C^ > such that 

C^D~P^ <1{s,s^)<C+D;,P^ . 

(Arf-) Lower regularity of the partitions for C{X): Dm min>,gAm Px ^ c^e > 0. 

Let m be the model chosen by algorithm\M (under restrictions (PI — 3), with r] = 1). Then, 
there exists a constant K2 and a sequence e„ converging to zero at infinity such that 

(13) l(^s,s-)<{l + en) inf {l{s,Sm)} 

with probability at least 1 — K2n~'^. Moreover, we have the oracle inequality 

(14) E[/(s,?J] <(l + e„)E 

The constant K2 may depend on V and constants in (Ab), (An), (Ap), (Ar|^) and (PI — 3), 
but not on n. The term is smaller than ln(n)~"^/^ for instance; it can also be taken smaller 
than for any < 5 < (5o(/3i, /?2), at the price of enlarging K2. 

We first make a few comments on our assumptions. 

1. When assumption (P3) is satisfied with > 1, the same result holds with a leading 
constant 277 — 1 + instead of 1 + in (jl3p and (|14p . 

2. In Thm. [21 we assume that V is fixed when n grows. A careful look at the proof shows that 
we only need V < ln(n) for n large enough. With a few more work, we could go up to V of 
order for some 6 > depending on the assumptions of Thm. [21 but we can not handle 
the leave-one-out case {V = n). This is probably a technical restriction, since a similar 
result for several exchangeable weights (including leave-one-out) is proven in Chap. 6 of 



A^K2 



inf {l{s,Sm)} + , 
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3. (Ab) and (An) are rather mild (and neither A nor CTmin need to be known from the 
statistician). In particular, they allow quite general heteroscedastic noises. They can even 
be relaxed, for instance thanks to results proven in Chap. 6 and Sect. 8.3 of |Arl07| . allowing 
the noise to vanish or to be unbounded. 

4. (Ar^) is satisfied for "almost regular" histograms when X has a lower bounded density 
w.r.t. Leb, as for instance all the simulation experiments of Sect.Hl 

5. The upper bound in (Ap) holds when {Ix)\£A^ is regular and s a-holderian with a € 
(0,1]. The lower bound may seem more surprising, since it means that s is not too well 
approximated by the models Sm- However, it is classical to assume that l{s,Sm) > for 
every m G Ain for proving the asymptotic optimality of Mallows' Cp {e.g. by Shibata 
|Shi81| . Li |Li87| and Birge and Massart |BM06] 1. We here make a stronger assumption 
because we need a non-asymptotic lower bound on the dimension of both the oracle and 
selected models. The reason why it is not too restrictive is that non-constant ct-holderian 
functions satisfy (Ap) with 

Pi = k'^ + a'^ -{k- l)k'^a'^ and P2 = '^ak'^ , 

when (/A)AeA,„ is regular and X has a lower-bounded density w.r.t. the Lebesgue measure 
on X CR'' {cf. Sect. 8.10 in |Arl07 j for more details). Notice also that Stone |Sto85| and 
Burman [Bur02j used the same assumption in the density estimation framework. 

Theorem [2] has at least two major consequences. First, V-fold penalties provide an asymptoti- 
cally optimal model selection procedure, at least in the histogram regression framework, as soon 
as C ~ y — 1. This should be compared to Thm.[Tl where we proved that IZ-fold cross-validation 
is suboptimal for a rather mild homoscedastic problem. Notice that a slight modification of the 
proof of Thm. [2] shows that several other cross-validation like methods (even with the same 
computational cost) have similar theoretical properties. We discuss this point in Sect. O 

Second, Thm. [2] can handle several kinds of heteroscedastic noises, while Algorithm [2] does 
not need any knowledge about a, \\Y\\^ or the smoothness of s. Even the tuning of C and 
V can be made (at least at first order) without any information on the distribution P of the 
data. This shows that T^-fold penalization is a naturally adaptive algorithm, as long as M.n allows 
adaptation. The point here is that when s belongs to some holderian ball Ti.{a, R) (with a S (0, 1] 
and R > 0), we can choose A4n as the family of regular histograms on X C M.^ to obtain such 
an adaptivity result. Then, from Thm. [2l we can build an estimator adaptive to (a, R) in a 
heteroscedastic framework (see |Arl07| for more details). If moreover the noise- level a satisfies 
some regularity assumption, we can show that this estimator attains the minimax estimation 
rate, up to some numerical constant, when a = k = 1. 

Notice also that a similar adaptation result could be obtained with IZ-fold cross-validation, 
which also satisfies (|13p and ()14p with leading constants K{V) > 1, under similar assumptions. 
The advance with l^-fold penalization is that we have simultaneously the adaptivity property of 
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y-fold cross-validation, its mild computational cost (when V is chosen small), and asymptotic 
optimality (contrary to VFCV). 

Finally, we would like to emphasize that building such estimators is not the final goal of 
penVF. As a matter of fact, there are several procedures that are adaptive to the smoothness of 
s and the heteroscedasticity of the noise {e.g. by Efromovich and Pinsker |EP96j or Galtchouk 
and Pergamenshchikov |GP05j ). and they may have better performances than both VFCV and 
penVF in this particular framework. Contrary to these ad hoc procedures, particulary built 
for dealing with heteroscedasticity, VFCV and penVF are general-purpose devices. What our 
theoretical results show is that they behave quite well in this framework, for which they were 
not built in particular. 

4. Simulation study. As an illustration of the results of the two previous sections, we 
compare the performances of VFCV, penVF (for several values of V) and Mallows' Cp on some 
simulated data. 



4.1. Experimental setup. We consider four experiments, called SI, S2, HSdl and HSd2. Data 
are generated according to 

Yi = s(X,) + a{Xi)ei 

with Xi i.i.d. uniform on X = [0; 1] and ~ AA(0, 1) independent from Xj. The experiments 
differ from the regression function s (smooth for S, see Fig. [21 smooth with jumps for HS, see 
Fig. [3]), the noise type (homoscedastic for SI and HSdl, heteroscedastic for S2 and HSd2) and 
the number n of data. Instances of data sets are given by Fig. H] to [71 Their last difference lies 
in the families of models. Defining 



yk,h,k2eN\{0}, (/a)a6a, 
3 J + l^ 



U 



o<j<fci-i 



J . J + 1 

k k / / o<j<k-i 

1 J__ 1 j + l 

2 2^2 ' 2 ^ 2A;2 



and 



.2/ci' 2ki 

the four model families are indexed by m G Mn C (N\ {0} ) U (N\ {0} 
SI regular histograms with 1 < D < n(ln(n))^^ pieces, i.e. 

n 



0<i<fc2-i 



Mr 



1,..., 



ln(n) 



S2 histograms regular on [0; 1/2] (resp. on [1/2; 1]), with Di (resp. D2) pieces, 1 < Di,D2 < 
n(21n(n))~^. The model of constant functions is added to Ain, i-e. 



-Mn = {l}u|l, 



n 



21n(n) 
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0.5 

Fig 2. s{x) = sin(7ra;) 




0.5 1 

Fig 4. SI: s{x) = sin(7ra;), cr = 1, n = 200 




0.5 1 

Fig 3. s{x) = HeaviSme(a;) (see fUTMl) 




0.5 1 

Fig 5. S2: s{x) = sin(7ra::), a{x) = x, n ^ 200 



HSdl dyadic regular histograms with 2*^ pieces, < k < ln2(n) — 1, i.e. 

Mn = {2'' s.t. < A; < ln2(n) - 1 } . 

HSd2 dyadic regular histograms with bin sizes 2-'=i and 2''"'', < ki,k2 < ln2(n) - 2 (dyadic 
version of S2). The model of constant functions is added to Mm *-e. 

= { 1 } U { 2^ s.t. < < ln2(n) - 2 }^ . 

Notice that we choose models that can approximately fit the true shape of (t{x) in experiments 
S2 and HSd2. This choice makes the oracle model even more efficient, hence the model selection 
problem more challenging. 

We compare the following algorithms: 
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0.5 1 

Fig 6. HSdl: HeaviSme, a = 1, n = 2048 




0.5 1 

Fig 7. HSd2: HeaviSine, a{x) = x, n = 2048 



VFCV Classical F-fold cross-validation, defined by (P), with V £ {2,5, 10,20}. 

LOO Classical Leave-one-out {i.e. VFCV with V = n). 
penVF y-fold penalty, with V G {2,5, 10,20}. C = Cw,oo = V — 1. The partition (Bj) is chosen 
once, as in Algorithm [H and penyp is defined by (jlip . In practice, this is almost the same 
as Algorithm [2j 
penLoo F-fold penalty, with V = n. C = Cw,oo = n — 1. 

Mai Mallows' Cp penalty: pen(m) = 2d'^Dm'n~^, where = 2n^^d^ (^i...n, 'S'„/2 ) is the clas- 
sical variance estimator [d being the Euclidean distance on M", Sn/2 ^^ly vector space of 
dimension n/2 of and ll...n = (^i, • • • , Yn) G M"). The non- asymptotic validity of this 
procedure for model selection in homoscedastic regression has been assessed by Baraud 

[Biioo]. 

E[peni^] Ideal deterministic penalty: pen(m) = E[penjj(m)]. We use it as a witness of what is a 
good performance in each experiment. 

For each penalization procedure, we also consider the same penalty multiplied by 5/4 (denoted 
by a + symbol added after its shortened name). This intends to test for overpenalization (the 
choice of the factor 5/4 being arbitrary and certainly not optimal). 

In each experiment, for each simulated data set, we replace by as in step 1 of 
Algorithm [21 Then, we compute the least-squares estimators Sm for each m G A4„. Finally, we 
select m G A4„ using each algorithm and compute its true excess loss l{s,s^) (and the excess 
loss l{s, Sm) for every m G Mn)- We simulate N = 1000 data sets, from which we can estimate 
the model selection performance of each procedure, through the two following benchmarks: 



and 



C'path— c 



E 



inf, 



l{s,Sr. 



Basically, Cor is the constant that should appear in an oracle inequality like (fT4l) . and Cpath-or 
corresponds to a pathwise oracle inequality like (fT3j) . As Cor and Cpath-or approximatively give 
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the same rankings between algorithms, we only report Cor in Tab. [H 

4.2. Results and comments. First of all, our experiments show the interest of both penVF 
and VFCV in several difficult framework, with relatively small sample sizes. Although it can not 
compete with simple procedures such as Mallows' Cp from the computational viewpoint, it is 
much more efficient when the noise is heteroscedastic (S2 and HSd2). In these hard frameworks, 
the performances of penVF and VFCV are comparable to those of the "ideal deterministic 
penalty" E [peuj^]. On the other hand, they perform slighlty worse than Mallows' for the easier 
problems (SI and HSdl), which we interpretate as the unavoidable price for robustness. 

Secondly, in the four experiments, the best procedures are always the over penalizing ones: 
many of them even beat the perfectly unbiased E [peu;^], showing the crucial need to overpenal- 
ize. This is mainly due to the small sample size compared to the high noise-level, since it is no 
the case when a is smaller, and less obvious when n is larger (see respectively experiments SO.l 
and SIOOO in Chap. 5 of |Arl07j ) . We would like to insist on the importance of this phenomenon, 
which is seldom mentioned because it it vanishes in the asymptotic framework, and it is quite 
hard to find from theoretical results. 

We can now come back to the discussion of Sect. 12.31 on the choice of V for VFCV, which 
is enlightened by the results of Tab. [TJ In the first three experiments, and more clearly in 
HSdl, V = 2 has comparable or better performances than V G {5,10, 20, n}. This is highly 
non intuitive, unless we consider the need for overpenalization in those experiments where the 
signal-to-noise ratio is quite low. It appears that the variability issue is less important in those 
three cases. This is not because the variance of critypcv is negligible in front of its bias, but 
mainly because its dependence on V is only mild. Hence, whatever V, it has to be compensate 
by overpenalizing. On the contrary, the best choices are ^ = 20 and V = n in experiment HSd2, 
where overpenalization seems to be less needed. The main conclusion here should be that one 
really has to take into account both overpenalization and variance for choosing an optimal V. 
The larger V is not always the better one, so that a larger computation time does not always 
improve the accuracy. The main difficulty here is that it does not seem straightforward to choose 
V from the data only. 

Finally, let us compare the performances of y-fold cross-validation and F-fold penalization in 
Tab.[ll At first glance, it seems that penVF with V < 20 performs worse than VFCV in the first 
three experiments, and not clearly better in the last one. The point is that it matches exactly 
with the experiments for which overpenalization is crucial. But looking at the performance of 
penVF+, we have evidence for the advantage conferred to penVF by its flexibility. In three over 
four experiments, penVF+ with any V £ {5, 10, 20, n } does better than VFCV with any choice 
of V; and it is almost the case for HSdl. This comes from the overpenalizing ability of y-fold 
penalization, which is crucial in such non-asymptotic situations. 

Moreover, choosing the optimal V for penVF or penVF+ is much simpler than for VFCV: it 
is always the largest V. Remark that V = n does not always perform significantly better than 
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Table 1 

Accuracy indexes Cot for each algorithm in four experiments, ± a rough estimate of uncertainty of the value 
reported (i.e. the empirical standard deviation divided by VN). In each column, the more accurate algorithms 
(taking the uncertainty into account; E[pen;j] and E [penj^j] + are not taken into account there) are bolded. 



Experiment 


SI 


S2 


HSdl 


HSd2 


s 

(7{x) 

n (sample size) 

Mn 


sin(7r-) 
1 

200 

regular 


sin(7r-) 

X 

200 

2 bin sizes 


HeaviSine 
1 

2048 

dyadic, regular 


HeaviSine 

X 

2048 

dyadic, 2 bin sizes 


E[peniJ 
E[penia] + 
Mai 
Mal+ 


1.919 ±0.03 
1.792 ±0.03 
1.928 ± 0.04 
1.800 ±0.03 


2.296 ± 0.05 
2.028 ± 0.04 
3.687 ± 0.07 
3.173 ± 0.07 


1.028 ± 0.004 
1.003 ± 0.003 
1.015 ± 0.003 
1.002 ± 0.003 


1.102 ± 0.004 
1.089 ± 0.004 
1.373 ± 0.010 
1.411 ± 0.008 


2-FCV 

5-FCV 

10-FCV 

20-FCV 

LOO 


2.078 ± 0.04 
2.137 ±0.04 
2.097 ±0.05 
2.088 ±0.04 
2.077 ±0.04 


2.542 ± 0.05 
2.582 ± 0.06 
2.603 ± 0.06 
2.578 ±0.06 
2.593 ± 0.06 


1.002 ±0.003 

1.014 ±0.003 
1.021 ±0.003 
1.029 ± 0.004 
1.034 ± 0.004 


1.184 ± 0.004 

1.115 ±0.005 
1.109 ±0.004 
1.105 ±0.004 
1.105 ± 0.004 


pen2-F 
pen5— F 
penlO-F 
pen20-F 
penLoo 


2.578 ±0.06 
2.219 ±0.05 
2.121 ±0.05 
2.085 ±0.04 
2.080 ±0.05 


3.061 ±0.07 
2.750 ± 0.06 
2.653 ± 0.06 
2.639 ± 0.06 
2.593 ± 0.06 


1.038 ± 0.004 
1.037 ± 0.004 
1.034 ± 0.004 
1.034 ± 0.004 
1.034 ± 0.004 


1.103 ±0.005 

1.104 ± 0.004 

1.104 ±0.004 

1.105 ± 0.004 
1.105 ± 0.004 


pen2-F+ 
pen5-F+ 

ponlO-F+ 
pen20-F+ 
penLoo+ 


2.175 ±0.05 
1.913 ±0.03 
1.872 ±0.03 
1.898 ±0.04 
1.844 ±0.03 


2.748 ± 0.06 
2.378 ± 0.05 
2.285 ± 0.05 
2.254 ± 0.05 
2.215 ±0.05 


1.011 ± 0.003 
1.006 ± 0.003 
1.005 ±0.003 
1.004 ± 0.003 
1.004 ±0.003 


1.106 ± 0.004 
1.102 ± 0.004 
1.098 ±0.004 
1.098 ±0.004 
1.096 ± 0.004 
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y = 20 or y = 10, which can be considered as ahuost optimal choices. For the practical user, 
the choice of V thus reduces to a trade-ofF between computational complexity and performance 
(the latter being governed by the variability of the y-fold penalties). Then, once V is chosen, C 
has to be taken equal to (V — 1) times the overpenalization factor (and estimating it from the 
data remains an open question). 

We conclude this section by some additional remarks, concerning some particular points of 
our simulation study. 

• We also performed Mallows' Cp (and its overpenalized version Mal+) with the true mean 
variance E[(T^(X)] instead of (which would not be possible on a real data set). It 
gave worse performance for all experiments but S2, in which Cor(Mal) = 2.657 it 0.06 and 
Cor(Mal+) = 2.437 lb 0.05. This shows that overpenalization is really crucial in experiment 
S2, even more than the shape of the penalty itself. But once we overpenalize, penVF+ 
remains significantly better than Mallows' Cp (critypcv being too variable for small V 
to do better than Mallows). The ability to overpenalize with penVF while keeping the 
variability low (i.e. V large) thus appears to be crucial in this case. In addition, it can be 
proved that Mallows' Cp penalty (and, more generally, any penalty of the form KDm) leads 
to suboptimal model selection in some heteroscedastic framework. See [Arl07j . Chap. 4. 
This should be compared to Thm. [21 which can be applied in that framework. 

• In experiment HSdl, 2-fold cross-validation appears to be among the best model selection 
procedures overall. This should be linked with the fact that A4„ only consists on histograms 
on dyadic partitions of [0,1], so that the assumptions of Thm. [1] are not fulfilled. More 
precisely, our computations may show that the model which minimize E [critvFCv(^)] with 
V = 2 IS the oracle model for arbitrarily large values of n. This emphasizes the fact that 
VFCV is not universally suboptimal for model selection for prediction. It is only unable to 
make the right choice among estimators whose excess losses are within a constant factor 
smaller than some K{V) > 1. 

• Eight additional experiments are reported in Chap. 5 of jArl07j . showing similar results 
with various n, a and s (the assumptions of Thm. [2] not being always satisfied). Notice 
that overpenalization is not always necessary, in particular when the signal-to-noise ratio 
is larger. In such situations, 1/ = 20 or y = n is generally optimal for VFCV. 

5. Discussion. 

5.1. V-fold cross-validation vs. V-fold penalties. Time has come for us to give an accurate 
answer to this practical (but quite hard) question: how to use F-fold? 

Firstly, the classical F-fold cross-validation is biased and asymptotically suboptimal for pre- 
diction in some "easy framework" (i.e. with a smooth regression function and an homoscedastic 
Gaussian noise). It thus has to be corrected, and we suggest a y-fold penalization algorithm that 
provides such a correction. This algorithm is asymptotically optimal in theory, quite efficient on 
some simulated data, and has the same computational cost as VFCV. 
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Secondly, a non-asymptotic phenomenon is likely to arise, that make the problem harder: when 
the sample size is small and the noise-level large, over penalizing procedures are more efficient 
than unbiased ones. Then, our IZ-fold penalization method allows to choose an over penalizing 
factor, whereas VFCV imposes it (through V) and a corrected VFCV forbids it. This flexibility 
is the main reason why we suggest to use penVF instead of VFCV or Burman's corrected VFCV. 
Otherwise, V has to be chosen very carefully, taking into account variability, bias and the possible 
need for some bias. 

We shall now explain how to use F-fold penalties. It depends on two tuning parameters: the 
number V of folds and the overpenalization factor C/{V — 1). The choice of V depends on the 
trade-off between variability and computational complexity. If the latter one does not matter, 
the optimal choice is close to V = n (at least for least-squares regression). Otherwise, the choice 
has to be done by the final user. We refer to asymptotic computations of Burman |Bur89t[Bur90j 
(in linear regression) and the recent work of Celisse and Robin |CR08| (in density estimation) 
for quantitative measures of variability according to V. Further research in that direction would 
be very useful for practical use of F-fold model selection criteria. 

The question of choosing the overpenalization factor is probably harder to solve. According to 
our simulation study, the optimal one depends at least on the sample size, the noise level and the 
smoothness of the regression function. Since the first criterion is that the penalty almost never 
underestimates the ideal one, a wise choice of C depends on the fluctuations of both the I^-fold 
penalty and the ideal penalty. We thus need a better understanding of the variability of penVF. 
Another idea would be to replace the conditional expectation in d?]) by a quantile, in order 
to build a simultaneous confidence region for the prediction errors (-P7 (sm) )meA^„- Then, we 
could deduce a confidence set, to which the oracle model should belong. Defining ffi as the more 
parcimonious model in this confidence set, we would have done the work of overpenalization by 
choosing the probability coverage of the confidence region. We refer to |Arl07| (Sect. 6.6 and 
11.3.3) for further discussions about overpenalization. 

5.2. Other cross-validation methods. In this paper, we focused on VFCV and penVF, among 
many other cross-validation like methods: hold-out, repeated learning-testing methods |BFOS84] . 
leave-p-out, etc. However, it follows from our proofs that the asymptotic performances of these 
methods mainly depends on their bias, which is itself a function of the ratio between the size of 
the learning set and the sample size. It is thus possible to have asymptotic optimality with any 
complexity cost, even without using penVF. 

Let us fix for instance the computational complexity to the one of 2-fold cross-validation. We 
may use 2-fold cross-validation, Burman's corrected 2-fold CV, 2-fold penalization or repeated 
learning-testing methods (with 2 splits of the data and a learning set of size equivalent to the 
sample size n). Asymptotically, the first one is suboptimal (Thm. [T|), while the three other 
ones are optimal (Thm [2] and the proof of Thm. [T|). We have already seen in Sect. 15.11 that 
Burman's corrected 2-fold can not overpenalize when needed, which can be a serious drawback 
in non- asymptotic situations. Repeated learning-testing does not have this drawback, since it is 
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possible to overpenalize within any factor C > 1 by choosing a learning set of size ~ n/[2C — 1). 

However, there remains a strong argument in favour of 2-fold penalization. When C has to 
be taken close to 1 (which is the asymptotic situation), repeating learning-testing requires the 
size of the learning set to be very close to n. Hence, if we can only make two splits, most of the 
data remains in both learning sets. This makes the final criterion much variable, since it strongly 
depends on the few data which belong to the union of the two training sets. On the contrary, 
with 2-fold penalization (as well as 2-fold cross-validation and its corrected version), each data 
point belongs is used once for learning and once for training. 

Finally, it seems to us that F-fold penalization should be preferred, because of its versatility: 
it is asymptotically optimal, quite flexible (for non-asymptotic situations) and makes use of all 
the data for both learning and training. 

5.3. Prediction in other frameworks. In order to make theoretical computations feasible, we 
restricted ourselves to the histogram regression framework in this article. Of course, this is only 
a first step towards a more general study of V-fold methods for model selection. Although all 
our proofs strongly rely on some particular features of histograms (in particular for computing 
expectations), we conjecture than most of our conclusions stay valid much more generally. The 
main argument supporting this claim is that part of our concentration inequalities are still valid 
in a general framework, including bounded regression and binary classification. Accurate state- 
ments and proofs are to be found in Chap. 7 of |Arl07| . In addition, penVF is built upon the 
same general heuristics as VFCV, and was never designed particularly for the heteroscedastic 
histogram regression problem. Hence, it should have at least the same robustness and adaptivity 
properties as VFCV, while its flexibility should allow better performance in terms of multiplica- 
tive constants (which may be crucial, when the sample size is small). 

Let us now point out some expected changes in our analysis in the general case. First, the no- 
overpenalization constant Cw,oo may not stay equal to V — 1. Although me mentioned an asymp- 
totic theoretical argument, it may break down when one considers models with a large number 
of parameters (that is, dependent from n). If this occurs, we suggest to use a data-dependent 
procedure for estimating Cw,oo, based upon the so-called "slope heuristics" |BM06tlAM08| . Basi- 
cally, it states that Cw,oo is twice the constant under which blows up dramatically. We refer 
to the above papers for a detailed statement of this algorithm, as well as theoretical insights. 

Second, the influence of V on variability may also be quite different. For instance, in clas- 
sification, it is often noticed that the leave-one-out is much more variable than VFCV with 
smaller values of V jHTFOlj . According to Molinaro, Simon and Pfeiffer [MSP05], this seems 
to disappear when the algorithm producing Sm is stable. In addition, in the density estimation 
framework, Celisse and Robin |CR08j also report that the variance of critypcv increases for large 
V. We believe that an extensive study of this variability issue in all those frameworks should 
be made, considering that it is a crucial point for choosing V for VFCV. It would also be quite 
interesting to determine whether the variability of penVF depends on V in the same way or not. 
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5.4. Consistency. We focused in this article on prediction, but one often uses model selection 
for identification. In this framework, one assumes that s £ Sm* (and maybe also to some more 
complex models), and the goal of a model selection procedure is to catch m* as often as possible, 
whatever the prediction risk of Sm*- Asymptotic optimality there become consistency, i.e. 

F(m = m* ) > 1 . 

There is a huge amount of papers about model selection for identification; we refer to the 
introduction of papers by Yang [YanOGj lYanOTj for references about the consistency of cross- 
validation in the regression and classification settings. 

The main point for consistency is that overpenalization is needed, even from the asymptotic 
viewpoint. This is the main reason why BIC is roughly the AIC criterion multiplied by a constant 
times ln(n). See also Aerts, Claeskens and Hart |ACH99j about this question. Our penalization 
interpretation of VFCV (and more generally, any cross-validation like method) then enlightens 
several theoretical and empirical results about the consistency issue. 

With VFCV, the overpenalization factor is bounded from above by 3/2 (which corresponds 
to V = 2). Hence, l^-fold cross-validation may be inconsistent in general for any V (although 
it can sometimes be used, when one compares sufficiently different models, see Yang |Yan07| ) . 
Moreover, the better choice is often V = 2 as remarked by Zhang |Zha93j . Dietterich |Die98] and 
Alpaydin |Alp99| . On the contrary, V-fold penalties could work, by choosing C (x {V — 1) ln(n) 
(for instance). We conjecture that such a method would be consistent, whatever V. 

More generally, it has been noticed several times that the consistency of cross-validation 
requires the size of the learning set to be chosen negligible in front of the sample size. In the linear 
regression framework, this has be shown by Shao |Sha93[ ISha97j . In the classification setting, 
this is called the "cross-validation paradox" by Yang |Yan06j . With penVF, we believe that we 
may have proposed a way of solving this paradox, by allowing to choose the overpenalization 
factor independently from the size of the learning set. 

APPENDIX A: PROBABILISTIC TOOLS 

In this section, we give some probability theory results that we need to prove our main result, 
while being of self-interest. In the rest of the paper, for any a,b £ M, we denote by a A 6 the 
minimum of a and b, and by a V 6 the maximum of a and b. 



A.l. Expectations of inverses of binomials. 

define 



For any non-negative random variable Z, 



E\Z]E 



Z 



-1 



z >0 



Non-asymptotic bounds on this quantity when Z has a binomial distribution are required in the 
proof of Prop. [H which is at the core of our main results. Former results concerning can be 
found in papers by Lew [Lew76j (for general Z) or Znidaric ZniOS] (for the binomial case), but 
they are either asymptotic or not accurate enough. The following lemma solves this issue. 
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Lemma 3. For any n G N\{0} and p G (0; 1], B{n,p) denotes the binomial distribution with 
parameters {n,p), K3 = 5.1 and = 3.2. Then, if up > 1, 



-np 



(15) K4 A (1 + K,Xnp)-^/^) > e+„p) > 1 - e 

In particular, e^^^^^ 1 when np — > cxd, which can be derived from [Zni05 



A. 2. Concentration of inverses of multinomials. Let (^A)AeAm ~ M-{n; {px)\t^A,n) 

a multinomial random vector, {a\)x,z\^ a family of non-negative real numbers, and define for 
every T G (0, 1] 

Zm,T ■■= E aAinin (t,X;^^) . 

AeAm 

Such a quantity naturally appears in our setting, mainly because of the randomness of the 
design. Unfortunately, classical concentration inequalities for sums of random variables can not 
be applied to Zm,T because the Xx are not independent. Using that they are negatively associated 
|JDP83j . we can use the Cramer-Chernoff method [DR98] to obtain the following lemma. Its 
complete proof can be found in Sect. 8.8 of |Arl07| . 

Lemma 4. Assume that mmx(=\^ {npx} > Bn > 1 and T G (0,1]. Define ci = 0.184, 
C2 = 0.28, C3 = 9.6, C4 = 0.09, C5 = 10.5, and for every t > 0, ipi{t) = max(t, ije" '"^^(^'i) . 

1. Lower deviations: for every x > 0, with probability at least 1 — e~^ , 



(16) E[Z„,,]-Z.^^,<^^1^^J2 — 

Ate 



3V2 



AeA„ 



{npx)' 



■\/4Djn exp{-ciB„ 



2. Upper deviations: for every x > 0, with probability at least 1 — e ^ , 



C2 



AeA^ 



( 

\npx 



(17) 



+ 



AeA„ 



\npx 



C4.B„ 



+ x) X C3 V 



^-->eA,„{ft}V^A.A„(;tr 



A. 3. Moment inequalities for some U-statistics. There are several papers about con- 
centration or moment inequalities for U-statistics, e.g. [GLZOOj IAda05j . It appears that our main 
results strongly rely on concentration properties for a particular kind of U-statistics of order 2, 
which are given by the following lemma. It can be derived either from the aforementioned papers, 
or from |BBLM05j . as we did in Sect. 8.9 of [Arl07] . 
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Lemma 5. Let (aA)AeAm cind ib\)\£A^ be two families of real numbers, {rx)\£A^ a family 
of integers. For all A G A^, let (^A,i)i<i<rA independent centered random variables admitting 
2q-th moments m2q,\i for some q >2. We define Sx^i, 5^,2 and Z as follows: 



(18) 



^ = («A'S'a,2 + bxSl^ij with Sx,i = Xl^^.i and Sx,2 = Y,^h 



AeAm i=l 

Then, there is a numerical constant k < 1.271 such that, for every q > 2, 



\Z-¥.[Z]\\<4^^ 



\ 



+ 8V2Kq 



AeA„ 



i=l 



\ 



E "5 E 



APPENDIX B: PROOFS 
B.l. Notations. Before starting the proofs, we introduce some notations or conventions: 

• The letter L will be used to design "some positive numerical constant, possibly different 
from some place to another" . In the same way, a constant which depends on ci , . . . , will 
be denoted Lci,...,Cfe, and if (A) denotes a set of assumptions, -^^(a) will be any constant 
that depends on the parameters appearing in (A). 

• For any non-negative random variable Z, we define e^^^^ := E [Z] E [Z~^1z>o] ■ 

• For every model m G Mn, and every j E {1, . . . ,V}, 



pi{m) := P {-f{sm) - l{Sm) ) 
p^^^\m) :=p(^(4-i))-^(.„)) 

~5{m) := {P^-P){^{sm)-l{s)) 



P2{m) := Pn (7(Sm) - l{Sm) ) 

p^f^\m) := P(-^) (7(.^) - 7(4-''^)) 
Histograms-specific notations: for any random variable Z, q > 0, m e Mn and A G A^: 



E [Z| (lx.e/A)i<j<n,AeA„ 
Sx,i:= E i^i-P^) and 



Sx,2--= E (^^-/^a) 



(Am) JffAm l^l^l'Jj^/'? 

2 



:= E- 



Conventions for pi and p2 when Sm is not well-defined (in the histogram framework) : 



(19) Piim)=pi^''\m)+ ^ pxiaxfl-^^o with piW(m) = ^ 



Pxt 



P\>0 c2 



AeA„ 



P2i"7.j 



AeA„ 



{npxr 



St. 



:=P2im) + - Yl M'l 



xeAr, 



npx=0 
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Notice that whatever the convention we choose (and even if we keep their original defi- 
nition), pi and p2 have the same value when Sm is uniquely defined, and we will always 
remove from A4n the other models. The choice we make here is only important when writ- 
ing expectations, so it is merely technical. In the following, we will often write simply pi 
(resp. P2) instead of pi (resp. ^2)- 

B.2. Proof of Thm. [Tl The idea of the proof is to show that criti(m) = (sm) and 
crit2(?Ti) = critvFCv(^) ~ c (for some random quantity c independent from m) satisfy the 
assumptions of Lemma [6] below, on an event of large probability. To this aim, we will use Prop.[T] 
as well as concentration inequalities of Sect. IB.5[ 

First, we have to be more precise about what we do with models m such that sin is not well 
defined for at least one j G {1,...,^}. Denote En{m) this event. By (I56p in Lemma [121 En{i^) 
has a probability smaller than as soon as Dm < Ln(ln(n))~^, so that all the reasonable 
conventions will have the same effect. For the sake of simplicity, we choose in this proof is 
to eliminate such models from Mn- Notice that this removes automatically models such that 
min;vgAm {'^Pa} ^ 1) in particular all models of dimension strictly larger than n/2. 
Denote c = V'"^ Y^=i Pn \ (*)• Then, for every m G M.n-, 



V 

V 



1 ^ 

(20) crit2(m) := critvFCv("i) - c = l{s,Sm) + 7? X! {p'l^H'm) +'5^^\'m)^ + ool£;^(„) . 

i=i 

First, notice that for every j, conditionally to {Xi,Yi)^^^., Sm^'^ is deterministic. In addition, 

\\Y\\^ < A:=l + a \\e\\^ < 00 by assumption. So, Lemma [101 can be applied with t = Sm and 
n changed into Card(i?j) > Ln/V. More precisely, for every m G such that En{m) does 
not hold, for every j G {1, . . . taking x = 41n(n) and rj = ln(n)^^, there is an event of 

probability 1 — Ln~'^ on which 



(21) 5''^\m) 



ln(n) n 



A union bound shows that these inequalities hold uniformly over j and m on an event of prob- 
ability at least 1 — Ln^"^ . Combined with ()20p . this gives 

LVA^ ln(n)2 



1 ^ 

l{s,Sm) + -Y,p'-^'\m) 



n 



(22) crit2(m) > (l - ln(n)"^ 

and a similar upper bound. 

A second key remark is that for every j, p\ has the distribution of pi with a sample size 
n — Card(i?j) instead of n. We can then apply Prop. [9] (with 7 = 4) to get that on an event of 
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probability 1 — Ln ^, for every j G {1, . . . ,V} and m G M.n such that En{m) does not hold, 



(23) 




^(m) < E 








E 


P2 ^ 




(24) 


(-J 
Pi 


\m) > E 




- 


"ln(n)2D„i/2 + g-LnD,„i- 


E 


P2 


(m) 





(25) pS ^'^(m) > (Lln(n)-^ - La,^^^?^. 



m 



E 



P2 ■'^(W') 



Finally, since s{x) = x, X is uniform and the models are regular histograms on X = [0, 1], we 
can compute exactly for each model the bias and the variance term (when the sample size is n): 



(26) 



12DI 



and 



E[p2im)] 



n 



+ 



1 



12Dmn 



We now explain how this can be used to check the assumptions of Lemma [H Let ci and ki 
be positive constants to be chosen later. 

Small models. First, assume that Dm < ln{n)'^^ . Combining (f22]l . (pll) . (i26l) and using that 
E ^\m) > 0, crit2(m') is roughly of the order of the bias term. Hence, condition (j29p holds 
with C3 = L and K3 = 2ki when n > La,ct,v,ki - Notice that this holds for every ki > 0. 

Intermediate models. We now consider models of dimension In(n)'^^ < Dm < cin{ln{n))^^ . As 
already noticed, En{m) does not hold true for any of them, with a large probability. 

From (I22|) (and the similar upper bound), ([23]) and ([26]), it follows that condition ([28]) 

holds with a = 1/12, b = tr^, C = V/{V — 1), C2 = LA,v,a and K2 = 1, as soon as n > LA,a,v, 
ci < L and ki > 6. Very similar (and somehow simpler) arguments prove that the condition 
P7p holds with the same parameters. 

Large models. Finally, let m G be such that Dm > cin(ln(n))^^. Combining ()22p . (1250 and 
(|26p . crit2(m.) is roughly of the order of the variance term LE when n > LA,a,v,ci- 



As a result, condition (jSOp holds with C4 = LciO" and K4 = 2, for n > LA,a,v, 



Cl- 



Choosing now ci < L and ki = 6, the conclusion directly follows from Lemma [6] below. Notice 
that we have assumed several times that n > no = LA,a,v- These conditions can be dropped by 
choosing Ki > Hq. □ 

Lemma 6. Let a,b,{ci)^^,-^^,{Ki)^^-^^,Crich > and C > 1 be some constants, n £ N 
and Ain 0, set of indexes. Assume that for every m G M.n, Dm £ [^i^], and moreover that 
Vx G [l,n — Crich]; 3m G Mn such that Dm G + Crich]- Let criti and crit2 be some functions 
Mn ^ satisfying the following conditions: 
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(i) for every m G Mn, 

(27) criti(m)=(^-^ + ^)(l + 6i,^ 

(28) crit2(m)=f-^ + ^^Vl+e2, 



OTi/i maxi=i,2 sup^g_^^ ln(„)Ki<^„<^ |ei,m| < C2ln(n) "2. 
fiij /or e?;er?/ m G swc/i t/iai Dm < In(n)''^, 

(29) crit2(m) > C3 (ln(n))"''=* . 
(in) for every m £ Adn such that Dm > if^; 

(30) crit2(m) > C4(ln(n))~'''* . 



Then, there is some constant K{C) = 2^/'^ x 3 ^ (^C — 1^^ >0 and some uq > 

(depending on a, b, (cj)^<j<4, ('«i)i<j<4; ^rich o.^'d C) such that, if n > uq, for every fh G 
argmiiimg^,^ crit2(?n-), 

(31) criti(m) > fl + K(C)-ln(n)-'^2/5\ -^f {criti(m)} . 

SKETCH OF THE PROOF OF Lemma O We skip this proof which is only technical. The main 
arguments are the following. First, there is a model mi of dimension close to (2an)^'^^ 
so that criti(mi) is close to 

/3^2/3j^ 2/3 _ ggcond, any model fh which minimizes 
crit2(?Ti) must have a dimension close to (2an)^^^ (6C)^^^^. This implies that crit(m) is larger 
than (1 + K{C) - ln(n)-''2/5) 

criti(mi), and the result follows. □ 

B.3. Proof of Thm. [2l In this section, L(pVF) denotes a constant that depends only 
on the set of assumptions of Thm. [2l including V . For every m G M.n-, define pen(^(m) = 
Pi{m) +p2{m) — 6{m) = peni<j(m) + (P — P„)7(s). Then, by definition of peujj and fh, we have 
for every m £ Ain, 

(32) Ks^s-) - (pen'd(m) -pen(m)) < l{s,Sm) + (pen(m) -pen(d(m)) . 

The idea of the proof is to show that pen — pen-^ is negligible in front of /(s, Sm) for "reasonable" 
models {i.e., those which are likely to be either selected by penVF, or an oracle model) with a 
large probability. We will prove it by using Prop.[T]and[2l as well as the concentration inequalities 
of Sect. ED 

For every m G M.n, define An{m) = min;jgA,„ {^Px} and Bn{m) = min;^gA,„ {np\}. We now 
define the event r2„ on which the concentration inequalities of Prop. [9] and [TT] and Lemma [101 
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and [T2\ hold with 7 = om + 2 (or similarly x = (a^ + 2)ln(n)), for every m G Mn- Using 
assumption (PI), the union bound gives P(r2„) > 1 — Lcj^n~'^. 

First, let c > be a constant to be chosen later, and consider A4„, the set of models m E Ain 
such that ln(n)^ < Dm < cn(ln(n))^^. According to (Ar^^), this implies Bn{m) > c^£C~^ln(n), 

so that ([56]) ensures that A„(m) > ln(n) if c < L x . In particular, m G ^An on Qn- Now, 

using both bounds on Dm, by construction of fi^, 

I |pi("i) - E [pi{m)] \ , |p2(m') - IE [p2("i)]| , 6{m) 



max 



pen(mj 



[pen 



M]|} 



is smaller than L(pVF) ln(n) ^ (/(s, Sm) + E [^2(^1^) ] ) on this event, at least if c < L x (to ensure 
that Bn{m) is large enough). We now fix c = L^x „ that satisfies those two conditions. Using 
Prop. [21 Lemma[7]and the lower bound on Bn{m), we h.a,ve for every tti G 



-L 



(pVF) 



ln(n)V4 



Ks,Sm) < (pen-pen(d)(m) < 



2(r/-l) + 



L 



(pVF) 

ln(n)V4 



as soon as n > L(pVF) (this restriction is necessary because the bounds are in terms of excess loss 
of Sm instead of /(s, Sm) +E [p2 ])• Combined with ([32|) . this gives: if n > i(pVF) ^i^d c < L^x^ , 



(33) 



< 



277-1 + 



ln(n)V4 



X inX {l{s,Sm)} 



Second, we prove that any minimizer m of crit belongs to Mn on the event Qn- Define, for 
every m G A4„, crit'(m) = crit(m) — P„7 (s), which has the same minimizers over Ain as crit. 
According to (P2), there exists mo G M^n such that ^/n < Dm^ < c^ichV^- If 7^ > -^(pVF)) 
niQ G A^n, from which we deduce (using (Ap)) 



(34) 



crit' (mo) <l{s,Smo) + 



6{mo) + pen(mo) < i(pVF) ( 



n 



n 



-1/2 



On the other hand, if Dm < ln(n)^, we have 

(35) crit'(m) > l{s,Sm) - 6{m) - P2{m) > (ln(n) )~^'^^ - L^^ 



'ln(n) In(n)''' 
■^(pVF)- 



n 



n 



on In addition, if Dm > cn (In (n))""*^ and m G A^n, by Prop. [21 E^"" [pen(m) — P2{n^)] > 
jgAm [p2{m)]. As a consequence, by construction of Qn, we have pen(m) — P2{m) > (1 ~ 
L(pVF)ra~-^/^)E [p2("T') ] on it, so that 



(36) 



crit'(m) > pen(m) — P2{'m) — 6{m) > i(pVF) ln(n)' 
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when n > i(pVF)- Comparing (p4|) . ([35]) and (pHI) . it follows that m G on provided that 
n > -Z^(pVF)- 



Finally, we show that the infimum can be extended to A4n in the right-hand side of (j33p . with 
the convention l{s, Sm) = +00 if j4„(m) = 0. Using similar arguments as above (as well as the defi- 
nition of fi^, in particular (j35|) for large models), we have ^(s, Smo) ^ -^(pVF) (^^"^2/2 _|_ j^-i/2~j 
0„. On the other hand, for every m G A4„, if -Dm < ln(n)^, l{s, Sm) > ^m) > ^(pVF) ln(n)~^^i 
while if Dm > cn(ln(n))~-^, l{s,Sm) > L(^p-yp^ln{n)~'^ on r2„ as soon as n > i^(pVF)- Hence, if 
n > -Z^(pVF)) 110 model m ^ can contribute to the infimum in the right-hand side of ([33]). 



To conclude the proof of (fT3]) . we notice that -^^(pVF) ln(n) -"^/^ < = ln(n) -"^/^ if n > L(pVF)- 
All the conditions of the kind n > no can finally be removed by enlarging Ki so that KiUq^ > 1. 
The final remark concerning 6^ holds true because we can replace the threshold dimensions 
ln(n)^ and cn(ln(n))~^ for "small" and "large" models by some powers of n, as soon as the 
exponents are not taken too far from (resp. 1). 

We now get the more classical oracle inequality (fT3|) by noticing that l{s,Sm) < ^ a.s., so 
that 



< 



2r] — 1 + ln(n 



1-1/5 



E 



inf {l{s,Sm)} 

meMn 



B.4. Expectations. 
B.4.1. Proof of Prop, m 

Ideal criterion. We have to compute E [Pj (sm) — P^ (sm)] = E [pi(m)]. Assume that Sm is 
well-defined, i.e. miuAeA^PA > 0. Using that Sm minimizes P'yit) over t £ Sm, we have 

(37) pi(m)= Pa(/3a-^a)'= ^ 4^^^a,i so that E^- [p,{m)] = - ^ ^ {a^f . 

The result dSD follows, with Sn,p^ = - 1 if Pi =Pi^°\ or 6n,p^ = - l + npA(l-?'A)" 

if pi = pi. In each case, the proof of Lemma [3] gives non-asymptotic bounds on 5n,px- 

V-fold criterion. By definition ([1]), on the event on which Sm^^ is well-defined for every j, 

V 

V 



1 ^ 

critvFCvM = - ^ [p^^)(m) + - P) 7 (4-^)) + P7 



The second term is centered conditionally to (Xj,yi)j^^^, so that we only have to compute 
for every j. Since {Xi,Yi)i^B. is an i.i.d. sample of size n — Card(i?j), we can apply 



E 



y-FOLD PENALIZATION 



33 



the above computation of E [pi]. Using a convention similar to pi^^^ (which can be used on real 
data, since it does not depend on P), the result ^ holds with 



1 ^ 



n — n/V 
n — Card(5j 



-Card(i?,),p;,) 



1 + 



1 



Card(Bj) 

n - Cavd{Bj) ~ V -1 



Prom Lemma [HI we deduce that if n ^ maxj Card(i?j) < cb < 1, then 



I - CB '^^ 



L 



Similarly to the computation of pi(m), when m\n\^i^^p\ > 0, we have 



(38) P2{m) ^ J2 P>^ = E 



np\>0 



SO that E"^" \p2{m)] = - V {ax) 



n^Px 



AeA„ AG A™ AeA„ 

Notice that E^™ [P2(™')] = E [p2{rn)] on this event. Using Lemma [3l this proves the following. 
Lemma 7. // minAeA,„ {npx]>B>l, 

(39) (l-e-^)E[p2(m-)] <¥.\p{^^\m)] <E[pI(m)] < ( 1 + sup (5„ p ) E [ps (H ] 

^ ^ -I \ np>B ' J 

where 5n,p comes from Prop.[l\ A similar result holds withp2 instead ofp2 inside the expectation. 

B.4.2. Proof of Prop. [H First of all, notice that all this proof is made conditionally to 
i^x,eix)i<i<n AeA • '^^^ outline of the proof is to prove that E^'" [peuyp] can be derived 
from the case where W satisfies an exchangeability condition, for which we can use Lemma [8] 
below. This is why we consider more generally the penalty peny(/ ^m, {Xi,Yi )i<j<„) , defined by 
(jlip for a general weight vector W G M", strengthening its dependence on the distribution of W 
and the data. When W is the subsampling weight vector of interest, pen^^/ coincides with the 
definition of peuyp in Algorithm [2j 

Let (T be a random permutation of { 1, . . . , n}, independent from W and the data, and uniform 
over the permutations that leave (IXiG/A )i<i<n agA^ invariant. Defining W = ^^(i) ^ , 



E^ 



pen- (m, iXi,Yi] 



l<i<n 



E^ 



E' 



pen^y (^m,(x,-i(,),y,-i(,)^^^.^^ 



since the penalty does not depend on the order of {Wi, Xi,Yi)x^£ix (for the first equality), and 
{Xi,Yi)x^£ix is exchangeable (for the second equality). Moreover, for every A G A^, {Wi)xi£ix 
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is exchangeable and independent from {Xi,Yi)xi(^i),- We can thus use Lemma [8] to compute 
pen^{m). Then, 

C 

E^'-ipenM] = - J2 (^i,i^('^>PA) + i?2^^^.(n,PA)) (^a)' . 
AeAm 



It now remains to compute ^ and m>' ^ divides npx, then W\ = 1 a.s. and 



w 



R. 



%w 



iy — 1) ^ . For the general case, see the proof of Prop. 5.2 in |Arl07j (Sect. 5.7.2). □ 



Lemma 8 (Lemma 5.7 of [Arl07j ) . Let Sm be the model of histograms adapted to some 
partition {I\)x^\^, W G [0; oo)" be a random vector such that for every A E A^, (Wi)x^£ix 
exchangeable and independent from {Xi,Yi)xieix - Define the Resampling Penalty for histograms 
as (fTTIl . and assume min^eA^ {np\} > 2. Then, 



n 



C npxSx 2 — 
(40) pen(m) = — > {Ri,w{n,Px) + R2,wi'^^Px)) ^ ; — - 

r, ^-^ fipx — 1 



AeA„ 



(41) Ri,w{n,px) 



jw^, - WxY 
wl 



VFa > 



where 

{W,, - Wx? 
Wx 



and ix is any index such that Xi^ G Ix ■ 

B.5. Concentration results. In order to prove Thm.[T]and[2l we need to combine Prop.[T] 
and [2] with concentration inequalities, which are the purpose of the present section. Let Sm be 
the model of histograms associated with some partition [Ix)x<^k^^ assume that both (Ab) 
and (An) are satisfied (see the statement of Thm. [2]). 

Our first result has to deal with pi and p2, which are the main components of the ideal penalty. 
Whereas concentration for p2 can be obtained in a general framework (see |Arl07] . Chap. 7), 
lower bounds on pi are completely new, up to our best knowledge. 

Proposition 9. Let 7 > and assume that minAgA™, {^Pa} ^ Bn- Then, if Bn > 1, on an 
event of probability at least 1 — Ln~'^ , 

(42) pl(m) > E [pl(m)] - La,^^,^,^ [ln(n)2D„V2 + g-LBn] e [p^im)] 

(43) plim) < E [plim] ] + La,^^,^,^ [ln{nfD-,^^^ + ^^e"^^" ] E [p2(m) ] 

(44) \p2{m)-E[p2{m)]\<LA,.^,^,jD-y^ln{n)E[p2{m)] . 

In addition, if Bn > 0, there is an event of probability at least 1 — Ln~^ on which 

1 



(45) 



Pi{m) > 



2 + (7 + l)Bn^ ln(n) 



LA,a,^,^,jHnfD;;^y^ ]E[p2{m)] 
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PROOF OF Prop. [91 According to the explicit expressions (I37p and (j38p . pi{m) and P2{m) 
are both U-statistics of order 2 conditionahy to {lx^ei\){i,\)- Then, we use Lemma O with 
6,A = Yi- Px, ax = 0,bx= pxinpx)''^ for p{ and 6a = {n'^Pxy^ for P2- This proves, for all q>2, 



(46) 
(47) 



pi(m) - E^'" [pi(m)] 



(Am) 



< max 
AeAm 



pa- 
pa" 



Pa>0 



||p2(m) -E[p2(m)]||f-) < LA,.„,„I);.'/'<zIE[P2(m)] 



We deduce conditional concentration inequalities from those moment inequalities (for instance 
by Lemma 8.9 of |Arl07j ) . with a deterministic probability bound 1 — Le~^ = 1 — n~'^. Hence, 
we deduce unconditional concentration inequalities, and the result follows for p2- To control the 
remainder term for pi, we use [Ml in Lemma [T2l 

We now have to control the distance between E^™ [pi] and E [pi]. First, if Bn > 1, we can 
use LemmaHl taking Xx = npx and ax = Px (^a)^; according to ([37]). we have p{{m) = Zm,i and 
the concentration inequality for p{ follows. On the other hand, if we only know that Bn > 0, 
instead of using Lemma [H we remark that 



E"^™ [pi{m)] > min 
AeAm 



Px 

PX 



E^™ [P2im)] 



and the result follows thanks to (j55|) in Lemma [T2l 



□ 



We mention here a much classical result, which is a consequence of Bernstein's inequality, 
since it deals with sums of independent variables. We refer to |AM08] for a detailed proof. 

Lemma 10 (Prop. 3, [AM08| ). Let t be any deterministie predietor. For every x > 0, there 
is an event of probability at least 1 — 2e^^ on which 



(48) 



Vt? > 0, 



|(P-P0(7(0-7(s))l<^?«(s,t) + ( 



4 8\ A^x 
- + 
V 



n 



Finally, we consider the l^-fold penalties defined by Algorithm [2l 

Proposition 11. Let pen(m) be defined by (fTOl) with the weights W defined in Algorithm\M 
and 7 > 0. There is an event of probability at least 1 — on which, i/min^gAmPA > 0, 



(49) pen(m) - E^" [pen(m) ] 



< C 



miuAeA^ { npx } V 
PROOF OF Prop. [HI By definition ([101), pen(m) = E^y [Z] with 



(50) Z=Y. {px+pY){h-A 



AeA„ 



E 

AeA,„ 



l + Wx 
n'^PxW? 



E iWx-Wi)iYi-px: 
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For every q > 1, using Jensen inequality and the independence between W and the data 
(conditionally to (Ix^eh )i a)' 



pen(m) — E^™ [pen(m)] 



(Am) 



< 



Z-K^"^ [Z\ W] 



(51) 



< sup 

Wo&upp{W) 



(Am) 

Q 



Z W = W( 



1 



where supp(Ty) is the support of the resampling weight vector W distribution (conditionally to 

(lx,e/A )j a) ^^'^ IMIg^"'^™'' denotes the q-th moment conditionally to {Ix.eix )(j a) ^ ~ ^o- 
In other words, the deviations of pen are smaller than those of the worse case with a deterministic 
weight vector Wq £ supp(VF). 

Prom now on, we work conditionally to ( Ix^g/a )(i a) assume that W £ M" is deterministic, 
among those authorized by Algorithm [2j Denote by -^(i,a); ■ ■ ■ > -^{npx A) data such that Xi G 
Ix- According to ([50]) . Lemma [5] with rx = npx, ax = 0, bx = {1 + M^a)(^^Pa^a) ""^ 
= (^(i,A) - ^a) {Y{i,x) - Px) shows that 



Z-K- 



A„ 



(WAm) ^ LA^q 
q ~ n 



\ 



AeA™ 



l + Wx 
npxW^ 



E(^M)-W^A 



A i=l 



We now fix some A E A^ and write npx = aV + 6 > 1 with a, 6 € N and < 6 < y — 1. Since 
W is in the support of the F-fold weights distribution of Algorithm [H there is an e G {0, 1} 
such that 

{ Wi s.t. ATj G 1;^ } = I repeated a + e times, — — ^ repeated rx — a — e times | . 



Hence, 



1 + 



h-Ve 



{V -l){aV + h) 
so that for every q > 2, 

/ \ TT-i A r /NT i^m) 

pen(m) — Hi " [pen(mjj 



and 

i=l 



) <Lx 


npx ' 







< LA^q 



1 

V — 



miuAeA,^ {npx} V 



The classical link between moment and concentration inequalities {e.g. Lemma 8.9 in jArlOTj ) 
gives ()i9]) conditionally to {\xi&ix)i x- remove this conditioning since the probability 

bound 1 — n~'^ is deterministic. □ 
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B.6. Expectation of inverses of binomials (proof of Lemma [3l). Let Z ~ B{n,p). By 
Jensen inequality, 

> P(Z > 0) = 1 - (1 -p)" > 1 - e-"P . 

For the upper bound, define 



(52) 



Z-Hz>o =e+P(Z>0) , 



so that we can focus on eg(„p)- 



(53) 



The bound by K4 follows from Lemma 4.1 of |GKKW02] . according to which 

2np 



VnEN, Vpe [0,1], eO(„_p)< 



< 2 . 



(n + l)p 

We can now assume that np > A > 29.17 since otherwise, 1 + K3{np)~^^^ > K4. Using that 
P(l > Z > 0) = 0, we have for every a > 0, 



°B(n,p) 



npE, 



aE[Z]>Z>0 



+ npE \z-Hz>aE[z]] < npF {anp > Z >0) + . 



We now bound the probability on the right-hand side thanks to Bernstein's inequality {e.g. 
Prop. 2.9 of |Masn70 : 



ye > 0, 



z < ( 1 -V29- 



np \ < e 



-8np 



and 9 = A Straightforward computations shows that 



sup {e^(„,p)} < 

np>A ^ ' 



1 - V2A-V4 _ 1^-1/2 



Ae 



1 - e- 



from which the result follows. 



B.7. A technical lemma. Because of the randomness of the design, we have to ensure 
that the empirical frequencies np\ are not too far from the expected ones np\. 

Lemma 12. Let {px)xeA,n non-negative real numbers of sum 1, {npx)\^A^ a multinomial 
vector of parameters (n; {px)\e\m), 7 > 0. Assume that Card(Am) < n and min^eA^ {np\} > 
Bn > 0. There is an event of probability at least 1 — on which the following three inequalities 
hold. 



(54) 
(55) 
(56) 



Px 



AeAm L Px 



max 



PA>0 



> 



< L X (7 + 1) ln(n) 
1 

2 + (7 + l)Bn^ ln(n) 

mm I np\ ] > 2(7 + 1) ln(n) 

AeAm 2 



. Pa 
mm < — 



38 



ARLOT, S. 



PROOF OF Lemma I12[ Those three results come from Bernstein's inequahty {e.g. Prop. 2.9 
of |Mas07| ) appHed to npx: for every A G A^, there is a set of probabihty 1 — 2n~^'^'^^'' on which 



npx 



'2npx{'y + 1) ln(n) 



(7 + 1) ln(n) 



< npx < npx + J 2npx{'^ + 1) ln(n) + 



(7 + 1) ln(n) 



For (j54p . if npx > 8(7 + 1) ln(n), the lower bound gives the result. Otherwise, remark only that 
(pa/pa)1p^>o ^ "■Pa < 8(7 + lln(n). For (f55]) . use the upper bound and remark that npxi'y + 
1) ln(n)i?~^ > (7+1) ln(n). For (f56l) . use the lower bound and remark that \/2npx{'^ + 1) ln(n) < 
{npx)/2 + (7 + 1) ln(n). Finally, the union bound gives the result since Card(Am) < n. □ 
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(N 

i-C ■ This is a technical appendix to "y-fold cross- vahdation improved: 



l^-fold penahzation" . We present some additional simulation experi- 
ments, a few remarks about expectations of inverses, and the proofs 
' which have been skipped or shortened in the main paper. 



H 



X 



Throughout this appendix, we use the notations of the main paper [Arl08| . In order to distin- 
guish references within the appendix from references to the main paper, we denote the former 
ones by (1) or 1, and the latter ones by (1) or 1. 

Fohowing the ordering of [Arl08] , we first present the additional simulation studies mentioned 
in Sect. 4. Then, we add a few comments to Appendix A.l. Finally, we give some technical 
proofs. 

1. Simulation study. We consider in this section eight experiments (called SIOOO, SVO-1, 



(N 
> 

I SO.l, Svar2, Sqrt, His6, DopReg and Dop2bin) in which we have compared the same procedures 

' as in Sect. 4, with the same benchmarks, but with only = 250 samples for each experiment. 

^ ■ Data are generated according to 

(N 

O : Yi = s{Xi) + a{Xi)ei 

' with Xi i.i.d. uniform on X = [0; 1] and €{ ~ AA(0, 1) independent from Xj. The experiments 

differ from 

• the regression function s: 

— SIOOO, S-v/0.1, SO.l and Svar2 have the same smooth function as SI and S2, see Fig.[TJ 

— Sqrt has s{x) = y^, which is smooth except around 0, see Fig. [6l 

— His6 has a regular histogram with 5 jumps (hence it belongs to the regular histogram 
model of dimension 6), see Fig. [8l 

— DopReg and Dop2bin have the Doppler function, as defined by Donoho and Johnstone 
[ 0195] . see Fig. [TOl 
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• the noise level a: 

- a{x) = 1 for SIOOO, Sqrt, His6, DopReg and Dop2bin. 

- cj(x) = VoT for SVOA. 

- a{x) = 0.1 for SO.l. 

^ a{x) = tLx>i/2 for Svar2. 

• the sample size n: 

- n = 200 for S^OJ, SO.l, Svar2, Sqrt and His6. 

- n = 1000 for SIOOO. 

- n = 2048 for DopReg and Dop2bin. 

• the family of models: with the notations introduced in Sect. 4, 

- for SIOOO, SVO.l, SO.l, Sqrt and His6, we use the "regular" collection, as for SI: 



Mr 



n 



ln(n) 



for Svar2, we use the "regular with two bin sizes" collection, as for S2: 
for DopReg, we use the "regular dyadic" collection, as for HSdl: 



Mn = ['2!' s.t. < A; < ln2(n) - 1 } 



for Dop2bin, we use the "regular dyadic with two bin sizes" collection, as for HSd2: 

2 



Mn = {l}^[2^ S.t. < A: < ln2(n) - 2 



Notice that contrary to HSd2, Dop2bin is an homoscedastic problem. The interest of considering 
two bin sizes for it is that the smoothness of the Doppler function is quite different for small x 
and for x > 1/2. 

Instances of data sets for each experiment are given in Fig. [2H5l [TJ [9] and [TTl 

Compared to SI, S2, HSdl and HSd2, these eight experiments consider larger signal-to-noise 
ratio data (SIOOO, SVO.l, SO.l), another kind of heteroscedasticity (Svar2) and other regression 
functions, with different kinds of unsmoothness (Sqrt, His6, DopReg and Dop2bin). 

We consider for each of these experiments the same algorithms as in Sect. 4, adding to 
them Mai*, which is Mallows' Cp penalty with the true value of the variance: pen(m) = 
2E [cr^(X)] DmU'^. Although it can not be used on real data sets, it is an interesting point 
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Fig 4. Data sample for SO.l 



Fig 5. Data sample for Svar2 
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Fig 10. s(x) = Doppler(a;) (see \DJ93^ ) Fig 11. Data sample for DopReg and Dop2bin 
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of comparison, which does not have possible weaknesses coming from the variance estimator a^. 
Our estimates of Cor (and uncertainties for these estimates) for the procedures we consider are 
reported in Tab. [I]to[3] (we report here again the results for SI, S2, HSdl and HSd2 to make 
comparisons easier). On the last line of these Tables, we also report 

E[inf^gA^J(g,g^)] ^ ^^^^^ ^, _^ ¥.[l{s,s^)] 

infmg;H„ {IE[/(s,Sm)]} Cor ' inf^eA^n { ^ [/(s, Sm) ] } 

is the leading constant which appear in most of the classical oracle inequalities. Notice that C^^ 
is always smaller than Cor- 

It appears that the choice of V is still difficult for VFCV: V = 2\s optimal in SIOOO and Sqrt 
and y = 20 in the six other ones. On the contrary, V = n \s (almost) always better for penVF 
and penVF+, and overpenalization often improves the quality of the algorithm (but not always: 
see DopReg and SO.l). These eight experiments mainly show that the assumptions of Thm. 2 
are not necessary for penVF to be efficient. 

For the sake of completeness, we also reported the results for the twelve experiments in terms 
of the other benchmark ^ 

r~i 117 ^ ' m' 

"-^path-or •— ^ '■ 7 77 ~ 

in Tab. [4] to Tab.[6l They are indeed quite similar to the previous ones. 

2. Addendum to Appendix A.l. Whereas Lemma 3 is stated for the particular case of 
Binomial variables, it is worth noticing that ingredients of its proof can be successfully used in 
order to derive non-asymptotic bounds on e.^(^z) °^ ^£(z) several other distributions than the 
Binomial one. This has for instance be used in Sect. 6.7 of |Arl07| for the Hypergeometric and 
Poisson case. 

First, the lower bound in (15) comes from Jensen's inequality: 

> F(Z > 0) . 

Second, taking = 0.16 in the proof of Lemma 3 gives the absolute upper bound 

6° < K4 = 7.8 

instead of the smaller value given by Lemma 4.1 of |GKKW02] . Hence, the proof of Lemma 3 
only uses that P(0 < Z < cz) = for some cz > and that Z satisfies a concentration inequality 
similar to Bernstein's inequality. This covers a wide class of random variables. 

Finally, notice that taking 9 = 3ln(A)/A at the end of the proof of Lemma 3, instead of 
6 = j4^^/^, leads to an upper bound 



ln(A) ( , 1 

for some numerical constant K5, showing that the rate A~^/^ is far from optimal. 
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Table 1 

Accuracy indexes Cot for experiments SI, S2, HSdl and HSd2 (N = 1000/ Uncertainties reported are empirical 

standard deviations divided by \/N . 



Experiment 


SI 


S2 


HSdl 


HSd2 


s 

(7{x) 

n (sample size) 


sin(7r-) 
1 

200 

regular 


sin(7r-) 

X 

200 

2 bin sizes 


HeaviSine 
1 

2048 

dyadic, regular 


HeaviSine 

X 

2048 

dyadic, 2 bin sizes 


Mai 
Mal+ 
Mai* 
Mal*+ 

E[penij] 
E pen- J + 

L id J 1 


1.928 ±0.04 
1.800 ±0.03 
2.028 ± 0.04 
1.827 ±0.03 

1.919 ±0.03 
1.792 ± 0.03 


3.687 ±0.07 

3.173 ±0.07 
2.657 ± 0.06 
2.437 ± 0.05 

2.296 ±0.05 
2.028 ± 0.04 


1.015 ± 0.003 

1.002 ±0.003 
1.044 ± 0.004 
1.004 ±0.003 

1.028 ± 0.004 

1.003 ± 0.003 


1.373 ± 0.010 

1.411 ±0.008 
1.513 ±0.005 
1.548 ± 0.003 

1.102 ± 0.004 
1.089 ± 0.004 


2-FCV 

5-FCV 

10-FCV 

20-FCV 

LOO 


2.078 ± 0.04 
2.137 ±0.04 
2.097 ±0.04 
2.088 ± 0.04 
2.077 ± 0.04 


2.542 ± 0.05 
2.582 ± 0.06 
2.603 ±0.06 
2.578 ± 0.06 
2.593 ± 0.06 


1.002 ± 0.003 
1.014 ± 0.003 
1.021 ±0.003 
1.029 ± 0.004 
1.034 ± 0.004 


1.184 ± 0.004 
1.115 ±0.005 
1.109 ±0.004 
1.105 ± 0.004 
1.105 ±0.004 


pen2-F 
pen5-F 
penlO-F 
pen20-F 

pciiLoo 


2.578 ±0.06 
2.219 ±0.05 
2.121 ±0.04 
2.085 ± 0.04 

2.080 ± 0.01 


3.061 ±0.07 
2.750 ±0.06 
2.653 ± 0.06 
2.639 ± 0.06 

2.593 ± 0.06 


1.038 ± 0.004 
1.037 ± 0.004 
1.034 ± 0.004 
1.034 ±0.004 

1.031 ± 0.001 


1.103 ±0.004 

1.104 ±0.004 

1.104 ± 0.004 

1.105 ± 0.004 

1.105 ± 0.001 


pen2-F-|- 

pen5-F-|- 
penl0-F+ 
pon20-F+ 
penLoo-l- 


2.175 ± 0.05 
1.913 ± 0.03 
1.872 ± 0.03 
1.898 ±0.03 
1.844 ±0.03 


2.748 ± 0.06 
2.378 ± 0.05 
2.285 ± 0.05 
2.254 ±0.05 
2.215 ±0.05 


1.011 ± 0.003 
1.006 ± 0.003 
1.005 ± 0.003 
1.004 ± 0.003 
1.004 ± 0.003 


1.106 ± 0.004 
1.102 ± 0.004 
1.098 ± 0.004 

1.098 ± 0.004 
1.096 ± 0.004 


Cor/ Cor 


0.768 


0.753 


0.999 


0.854 
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Table 2 

Accuracy indexes Cot for experiments SIOOO, ^VoT, 50.1 and Svar2 (N — 250^. Uncertainties reported are 

empirical standard deviations divided by \/N . 



Experiment 


SIOOO 


SVOT 


SO.l 


Svar2 


s 


sin(7r-) 


sin(7r-) 


sin(7r-) 


sin(7r-) 


o{x) 


1 


VoT 


0.1 


la;>l/2 


n (sample size) 


1000 


200 


200 


200 


M„ 


regular 


regular 


regular 


2 bin sizes 


Mai 


1.667 ±0.04 


1.611 ±0.03 


1.400 ± 0.02 


5.643 ± 0.22 


Mal+ 


1.619 ±0.03 


1.593 ±0.03 


1.426 ± 0.02 


4.647 ±0.22 


Mai* 


1.745 ± 0.05 


1.925 ± 0.03 


3.204 ± 0.05 


4.481 ±0.21 


Mal*+ 


1.617 ±0.03 


2.073 ± 0.04 


3.641 ±0.07 


3.544 ±0.17 


E[penij] 


1.745 ± 0.05 


1.571 ±0.03 


1.373 ±0.02 


2.409 ±0.13 


E pen.H + 

I id J ' 


1.617 ±0.03 


1.554 ± 0.03 


1.392 ± 0.02 


2.005 ± 0.10 


2-FCV 


1.668 ± 0.04 


1.663 ± 0.04 


1.394 ± 0.02 


2.960 ± 0.15 


5-FCV 


1.756 ± 0.07 


1.693 ± 0.04 


1.393 ± 0.02 


2.950 ±0.16 


10-FCV 


1.746 ± 0.04 


1.684 ±0.04 


1.385 ±0.02 


2.681 ±0.14 


20-FCV 


1.774 ±0.05 


1.645 ± 0.03 


1.382 ±0.02 


2.742 ±0.16 


LOO 


1.768 ± 0.05 


1.639 ± 0.04 


1.379 ± 0.02 


2.641 ±0.15 


pen2-F 


2.066 ± 0.08 


1.809 ±0.05 


1.390 ±0.02 


3.209 ±0.18 


pen5-F 


1.816 ±0.05 


1.638 ±0.04 


1.400 ± 0.02 


2.749 ±0.15 


penlO-F 


1.783 ± 0.05 


1.706 ± 0.04 


1.374 ± 0.02 


2.598 ± 0.15 


pen20-F 


1.801 ± 0.05 


1.657 ± 0.03 


1.385 ±0.02 


2.684 ± 0.15 


penLoo 


1.776 ± 0.05 


1.641 ± 0.04 


1.379 ± 0.02 


2.656 ±0.15 


pen2-F+ 


1.809 ± 0.05 


1.714 ± 0.04 


1.416 ± 0.02 


2.808 ± 0.16 


pen5-F+ 


1.683 ± 0.04 


1.616 ±0.03 


1.399 ± 0.02 


2.460 ±0.14 


penlO-F+ 


1.627 ±0.04 


1.613 ±0.03 


1.385 ±0.02 


2.398 ±0.14 


pen20-F+ 


1.644 ±0.04 


1.583 ±0.03 


1.390 ±0.02 


2.316 ±0.13 


penLoo+ 


1.626 ± 0.03 


1.587 ±0.03 


1.401 ± 0.02 


2.349 ± 0.13 


Cor/ Cot 


0.8 


0.801 


0.816 


0.779 
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Table 3 

Accuracy indexes Cot for experiments Sqrt, His6, DopReg and Dop2bin (N = 250j. Uncertainties reported are 

empirical standard deviations divided by \/N . 



Experiment 


Sqrt 


His6 


DopReg 


Dop2bin 


s 


sf- 


Hise 


Doppler 


Doppler 


(j{x) 


1 


1 


1 


1 


n (sample size) 


200 


200 


2048 


2048 


Mn 


regular 


regular 


dyadic, regular 


dyadic, 2 bin sizes 


Mai 


2.295 ±0.11 


1.969 ±0.11 


1.039 ± 0.01 


1.052 ± 0.01 


Mal+ 


1.989 ±0.08 


1.799 ±0.09 


1.090 ± 0.00 


1.047 ±0.01 


Mai* 


2.483 ±0.12 


2.021 ±0.11 


1.013 ±0.01 


1.061 ±0.01 


Mal*+ 


2.075 ± 0.09 


1.836 ±0.10 


1.070 ±0.00 


1.041 ± 0.01 


E[penij] 


2.365 ±0.11 


1.805 ±0.10 


1.025 ±0.01 


1.056 ±0.01 


E pen- J + 

L id J 1 


2.012 ± 0.09 


1.632 ± 0.08 


1.083 ± 0.00 


1.040 ± 0.01 


2-FCV 


2.489 ±0.12 


2.788 ± 0.13 


1.097 ± 0.00 


1.165 ±0.01 


5-FCV 


2.777 ±0.16 


2.316 ±0.12 


1.064 ± 0.01 


1.049 ± 0.01 


10-FCV 


2.571 ±0.13 


2.074 ±0.11 


1.043 ±0.01 


1.051 ±0.01 


20-FCV 


2.561 ±0.12 


2.071 ±0.11 


1.034 ±0.01 


1.053 ±0.01 


LOO 


2.695 ± 0.14 


2.059 ±0.11 


1.026 ± 0.01 


1.058 ±0.01 


pen2-F 


4.088 ± 0.23 


3.210 ±0.14 


1.048 ±0.01 


1.062 ±0.01 


pen5-F 


3.024 ±0.18 


2.485 ±0.13 


1.033 ±0.01 


1.055 ±0.01 


penlO-F 


3.009 ±0.18 


2.192 ±0.12 


1.029 ± 0.01 


1.056 ± 0.01 


pen20-F 


2.723 ± 0.14 


2.150 ±0.12 


1.031 ±0.01 


1.056 ± 0.01 


pciiLoo 


2.695 ± 0.11 


2.053 ± 0.12 


1.026 ± 0.01 


1.058 ± 0.01 


pen2-F+ 


3.015 ±0.17 


2.728 ± 0.12 


1.084 ± 0.00 


1.084 ± 0.01 


pen5-F+ 


2.409 ± 0.13 


2.080 ± 0.09 


1.080 ± 0.00 


1.063 ± 0.01 


penlO-F+ 


2.305 ±0.11 


1.869 ± 0.09 


1.082 ± 0.00 


1.050 ± 0.01 


pon20-F+ 


2.180 ±0.10 


1.832 ±0.09 


1.079 ±0.00 


1.052 ±0.01 


penLoo+ 


2.152 ±0.10 


1.858 ±0.10 


1.082 ± 0.00 


1.048 ± 0.01 


Cor/ Cor 


0.795 


0.996 


0.998 


0.977 
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Table 4 

Accuracy indexes Cpath-or for experiments SI, S2, HSdl and HSd2 (N = lOOOj. Uncertainties reported are 

empirical standard deviations divided by \/N. 



Experiment 


SI 


S2 


HSdl 


HSd2 


8 

a{x) 

n (sample size) 


sin(7r-) 
1 

200 

regular 


sin(7r-) 

X 

200 

2 bin sizes 


HeaviSine 
1 

2048 

dyadic, regular 


HeaviSine 

X 

2048 

dyadic, 2 bin sizes 


Mai 

Mal+ 

Mai* 

Mal*+ 

E[peniJ 

E[peiiid]+ 


2.064 ±0.04 
1.921 ±0.03 
2.168 ±0.04 
1.941 ± 0.03 
2.053 ± 0.04 
1.903 ± 0.03 


4.129 ±0.10 
3.500 ± 0.09 
2.907 ±0.07 
2.645 ± 0.06 
2.458 ± 0.06 
2.142 ± 0.04 


1.015 ±0.002 

1.002 ± 0.001 
1.045 ± 0.003 
1.004 ± 0.001 
1.029 ± 0.003 

1.003 ± 0.001 


1.316 ±0.010 
1.354 ± 0.008 
1.453 ± 0.006 
1.487 ±0.005 
1.050 ± 0.002 
1.038 ± 0.002 


2-FCV 

5-FCV 
10-FCV 
20-FCV 
LOO 


2.230 ± 0.05 

2.290 ±0.05 
2.237 ±0.05 
2.225 ± 0.05 
2.212 ±0.05 


2.755 ± 0.06 

2.827 ±0.08 
2.832 ±0.08 
2.794 ±0.07 
2.832 ± 0.08 


1.002 ± 0.001 

1.014 ±0.002 
1.021 ±0.002 
1.029 ± 0.003 
1.034 ± 0.003 


1.134 ± 0.004 

1.064 ± 0.003 
1.057 ±0.002 
1.054 ±0.002 
1.053 ± 0.002 


pen2-F 

pen5-F 

penlO-F 

pen20-F 

penLoo 


2.770 ± 0.07 
2.383 ± 0.06 
2.256 ± 0.05 
2.219 ±0.05 
2.215 ±0.05 


3.340 ± 0.08 
2.982 ± 0.08 
2.867 ± 0.07 
2.869 ± 0.08 
2.832 ± 0.08 


1.039 ±0.003 
1.038 ± 0.003 
1.035 ± 0.003 
1.035 ± 0.003 
1.034 ±0.003 


1.052 ±0.003 

1.053 ± 0.002 
1.053 ± 0.002 
1.053 ± 0.002 
1.053 ± 0.002 


pen2-F-|- 

pen5-F+ 

pcnlO-F-F 

pen20-F-|- 

penLoo+ 


2.328 ±0.05 
2.050 ± 0.04 
1.997 ±0.03 
2.018 ±0.04 
1.959 ± 0.03 


2.979 ± 0.07 
2.540 ± 0.06 
2.436 ± 0.05 
2.416 ± 0.06 
2.397 ±0.06 


1.011 ±0.002 
1.006 ± 0.001 
1.005 ±0.001 
1.004 ± 0.001 
1.004 ± 0.001 


1.056 ± 0.003 
1.052 ± 0.002 
1.048 ± 0.002 
1.047 ±0.002 
1.045 ± 0.002 
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Table 5 

Accuracy indexes Cpath-or for experiments SIOOO, SVO.l, SO.l and Svar2 (N = 250). Uncertainties reported are 

empirical standard deviations divided by \/N . 



Experiment 


SIOOO 


sVoT 


SO.l 


Svar2 


s 


sm(7r-) 


sm(7r-) 


sm(7r-) 


sm(7r-j 


a(x) 


1 


a/OT 


0.1 


la;>l/2 


n (sample size) 


1000 


200 


200 


200 


Mn 


regular 


regular 


regular 


2 bin sizes 


Mai 


1.704 ± 0.04 


1.654 ± 0.03 


1.407 ± 0.02 


7.212 ± 0.40 


MaH- 


1.670 ± 0.03 


1.636 ± 0.03 


1.436 ± 0.02 


5.740 ± 0.34 


Mal* 


1.793 ± 0.04 


2.018 ± 0.04 


3.273 ± 0.06 


5.597 ±0.33 


Mar+ 


1.664 ± 0.03 


2.175 ± 0.05 


3.719 ± 0.08 


4.284 ± 0.25 


E[penid] 


1.793 ± 0.04 


1.611 ± 0.03 


1.378 ± 0.01 


2.785 ± 0.19 


E[penid]+ 


1.194 ±0.02 


1.177 ±0.02 


1.128 ±0.01 


1.337 ± 0.07 


2-FCV 


1.721 ±0.04 


1.723 ± 0.04 


1.400 ± 0.02 


3.507 ±0.19 


5-FCV 


1.801 ±0.06 


1.740 ± 0.04 


1.399 ±0.02 


3.486 ± 0.24 


10-FCV 


1.802 ±0.05 


1.735 ±0.04 


1.388 ±0.02 


3.149 ±0.20 


20-FCV 


1.832 ±0.05 


1.687 ±0.03 


1.388 ±0.02 


3.257 ±0.23 


LOO 


1.815 ±0.05 


1.685 ± 0.04 


1.385 ± 0.01 


3.127 ±0.24 


pen2-F 


2.108 ±0.07 


1.864 ±0.05 


1.394 ±0.02 


3.839 ±0.27 


pen5-F 


1.852 ± 0.05 


1.675 ± 0.04 


1.404 ± 0.02 


3.237 ±0.23 


penlO-F 


1.812 ±0.05 


1.767 ± 0.04 


1.381 ± 0.01 


3.093 ± 0.23 


pen20-F 


1.839 ± 0.05 


1.706 ± 0.03 


1.391 ±0.01 


3.123 ±0.23 


penLoo 


1.825 ± 0.05 


1.687 ±0.04 


1.385 ±0.01 


3.152 ±0.24 


pen2-F+ 


1.852 ±0.05 


1.765 ± 0.05 


1.420 ± 0.02 


3.336 ± 0.23 


pen5-F+ 


1.732 ± 0.04 


1.664 ± 0.03 


1.408 ± 0.02 


2.890 ± 0.22 


ponlO-F+ 


1.663 ±0.04 


1.657 ±0.03 


1.394 ±0.02 


2.810 ±0.21 


pen20-F+ 


1.680 ± 0.04 


1.623 ± 0.03 


1.397 ±0.01 


2.657 ±0.19 


penLoo+ 


1.673 ± 0.03 


1.624 ± 0.03 


1.409 ± 0.02 


2.659 ±0.18 
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Table 6 

Accuracy indexes Cpath-or for experiments Sqrt, His6, DopReg and Dop2hin (N = 250j. Uncertainties reported 

are empirical standard deviations divided by \/iV. 



Experiment 


Sqrt 


His6 


DopReg 


Dop2bin 


8 




Hise 


Doppler 


Doppler 


a{x) 


1 


1 


1 


1 


n (sample size) 


200 


200 


2048 


2048 


Mn 


regular 


regular 


dyadic, regular 


dyadic, 2 bin sizes 


Mai 


2.557 ±0.12 


2.356 ± 0.18 


1.040 ± 0.00 


1.049 ± 0.00 


Mal+ 


2.232 ± 0.10 


2.041 ± 0.12 


1.094 ± 0.00 


1.045 ± 0.01 


Mai* 


2.838 ±0.15 


2.533 ± 0.21 


1.013 ±0.00 


1.057 ±0.00 


Mar+ 


2.349 ± 0.11 


2.168 ±0.16 


1.073 ± 0.00 


1.038 ± 0.00 


E[peniJ 


2.678 ± 0.14 


2.182 ± 0.17 


1.026 ± 0.00 


1.053 ± 0.00 


E[peiiid]+ 


1.348 ± 0.07 


1.230 ±0.06 


1.050 ± 0.00 


1.038 ± 0.00 


2-FCV 


2.974 ± 0.17 


3.713 ± 0.25 


1.100 ± 0.00 


1.164 ±0.01 


5-FCV 


3.209 ±0.21 


2.977 ±0.24 


1.066 ± 0.00 


1.046 ± 0.00 


10-FCV 


2.912 ±0.16 


2.639 ±0.21 


1.045 ±0.00 


1.047 ±0.00 


20-FCV 


2.889 ±0.15 


2.584 ±0.20 


1.035 ±0.00 


1.050 ±0.00 


LOO 


3.061 ±0.17 


2.568 ±0.21 


1.027 ±0.00 


1.055 ± 0.00 


pen2-F 


5.062 ±0.37 


4.462 ± 0.30 


1.050 ±0.00 


1.059 ±0.01 


pen5-F 


3.595 ± 0.25 


3.458 ± 0.28 


1.034 ± 0.00 


1.052 ± 0.00 


penlO-F 


3.445 ± 0.22 


2.744 ± 0.21 


1.031 ± 0.00 


1.053 ± 0.00 


pen20-F 


3.120 ±0.17 


2.670 ±0.21 


1.032 ± 0.00 


1.053 ± 0.00 


penLoo 


3.063 ± 0.17 


2.571 ± 0.21 


1.027 ±0.00 


1.055 ±0.00 


pen2-F+ 


3.723 ± 0.29 


3.777 ± 0.26 


1.087 ± 0.00 


1.082 ± 0.01 


pen5-F+ 


2.790 ±0.18 


2.698 ± 0.19 


1.083 ± 0.00 


1.061 ± 0.01 


pcnlO-F+ 


2.653 ± 0.14 


2.364 ±0.20 


1.085 ± 0.00 


1.047 ±0.01 


pen20-F+ 


2.497 ±0.13 


2.318 ±0.20 


1.082 ± 0.00 


1.049 ± 0.01 


penLoo+ 


2.437 ±0.12 


2.218 ±0.18 


1.085 ± 0.00 


1.045 ± 0.00 
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3. Additional proofs. 

3.1. Proof of Lemma 6. In this proof, we denote by L any constant that may depend on a, 
b, (ci)^<;j<4, («^j)i<j<4) Crich and C, possibly different from one place to another. 
First of all, there is a model mi £ Ain such that 

ln(n)''i < (^2anb~^y^^^ < Dm^ < {2anh'^y^^ + c^ch < cin(ln(n))~^ 
(at least for n> L). As a consequence, (27) implies that 

(1) criti(mi)<a^/362/3n-2/3|^3x2-2/3 + c,ieh(^y^'^ ( 1 + C2 ln(n)-"2 ) . 
With a similar argument, for n > L, there exists a model m2 £ -Mn such that 

(2) crit2(m2)<ai/=^(6C)'/3n-2/3 X 2-2/3 + c,i,h(^y^'^ ( 1 + C2 ln(n)-'^2 ) . 

We will now derive from ([2]) some tight bounds on D^. First, the upper bound in ^ is smaller 
than the lower bounds in both (29) and (30) for n > L. This proves that 

Hnr'<D-< "^"^ 



ln(n) 

1/3 



Then, according to (49), we have for every m G A4„ of dimension D^a = ( j [1 + 5] 
(which is between ln(n)'^i and j^^^ for n > L, as long SkS 1 < 8 > —1): 

crit2(m) > (6C7)2/3^-2/3 Z' 2-2/3 + j)-2 ^ 2^3(1 + 5)) {l - C2Hny^) 



1-C2ln(n)--^ /(<5) 

> Crit2(?7l2) X — — X 



1/3 



l + C2ln(n)-«2 3x2-2/3+CHchf^^ 
with / defined by f{5) = 2^2/3(1 _^ ^y2 _^ 2V3(i + Using Lemma [2 below, we then have 
crit2(m) ^ 1 - C2ln(n)-'"2 3 x 2^2/3 + 3 x 2-^'^'^ ((52 A 1) 



crit2(m2) 1 + C2ln(n) «2 3 ^ 2-2/3 + c^ch f — V^^ 



This lower bound is strictly larger than 1 as soon as 5"^ > ln(n) '^2/2 and n > L, so that 
(|^)'"(l-M»)-'')<«S<(|?)'"(l + Mn)-") . 
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We can now use (27) in order to bound criti(m). For n> L, using again Lemma [U 

criti(m) >aVV/3n-2/3 (^^^'^'^ ( 1 - L ln(n)~''^/^) 

> aV362/3^-2/3 ^ 2"2/3 + ^c'/^ - l)') (l - Lln(n)-'^^/4) 

> criti(mi) (^1 + 2^/3 X S^^ (c^^/^ - l)^ - ln(n)-'^2/5^ ^ 

which proves (31). □ 
Remark 1. A similar argument proves that for n > L, 

criti(m) < criti(mi) (^1 + 2^/3 x 3"^ (c^^/^ _ + Lln(n)-''2/4^ ^ 

Moreover, if criti satisfies (ii) and (iii), we prove in a similar way that if n > no, for every 
m G argmiumex^ crit2(m), 

(4) criti(m) < f 1 + K(C) +ln(n)"''2/5) inf {criti(m)} . 

This justifies our first comment behind Thm. 1. 

Lemma 1. Let f : (-l,+oo) be defined by f{x) = 2^2/3 _^ ^)-2 _^ 21/3(1 _^ 2;). Then, 
for every x > —1, 

/(x)>3x2-2/3 + 3x2-i^/3(x2ai) . 

PROOF OF Lemma [TJ We apply the Taylor-Lagrange theorem to / (which is infinitely differ- 
entiable) at order two, between and x. The result follows since /(O) = 3x2-2/3^ /'(o) =Oand 
f{t) = 6 X 2-2/3 X (1 + 1)-^ > 3 X 2I/3-4 if t < 1. If t > 1, the result follows from the fact that 
/' > on [0,+oo). □ 

3.2. End of the proof of Prop. 2. We here compute ^{n,px) and R^^{n,px) when V 
does not divide npx, that we have skipped in Appendix B.4.2. 



Since {Wi)x^<^ix is exchangeable and Wi takes only two values, 

Wx = Ew[W,\Wx] = .^^F(^Wi = ,^^ Wx 

Thus, 



C{Wi \ Wx) = .^^B{k-^Wx) 



14 

so that 



R2,w{n',P\) 
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and Ri^w{^,Px) 



V 



y _ 1 ^ '--^ y 

There exists a, 6 G N such that < b < V — 1 and npx = aV + h. Then, 



so that 



E 



y(a(y- 1) +6) 
(F-l)(ay + 6) 



V -b 
V 



and 



y(a(y-l) + 6-l) 
{V -l){aV + b) 



b 
V 



w: 



V -b{V -l){aV + b) ^b {V-l){aV + 



1 



V V{a{V -l) + b) VV{a{V-l) + b-l) 
b {V-l){aV + b)b 



We deduce 



V{aiV-l) + b) V^{a{V -l) + b-l){a{V -l) + b) 
1 b ^ (aF + 6)6 



The result follows with 



j(penV) 



V -I 



npx 



npx — a \ V npx — a — 1 





r 2 1 




0;^^ — ^ 




. npx-2_ 



□ 



3.3. Proof of Lemma 8. Although this lemma can be found in |Arl07| (where it is called 
Lemma 5.7), we recall here its proof for the sake of completeness. 



First, split the penalty (without the constant C) into these two terms: 



(5) 
(6) 



w 



AeA„ 



Px[Pf -Px 



Wx>Q 



P2{m)= Ew\pY{pY-Pxy 



AeA„ 



This split into two terms is the equivalent of the split of peu;^ into pi and p2 (plus a centered 
term) . 

We first compute this quantity, which appears in both pi and p2- let A G Am and Wx > 0, 

2 



(7) 



px {pY - P. 

1 



Wx 



PX 



rflpx 



w 



Wx 



Wx 



Wx 



+ 



1 



n?px 



Y {Yi - PxKY, - PxW 



w 



w 

Wx 



W^ 
Wx 



Wx 
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Since the weights are exchangeable, {Wi)x^<^ix is also exchangeable conditionally to W\ and 
(Xj)i<j<„. Thus, the "variance" term 

Rv{n,npx,Wx,C{W)) :=Ew [{W^-Wxf\ Wx 

does not depend from i (provided that Xi E Ix), and the "covariance" term 

Rc{n, npx, Wx,C(W)) := Ew m - Wx) {Wj - Wx) \ Wx] 

does not depend from (provided that i / j and Xi,Xj G Ix)- Moreover, 



= E 



w 



E iw^-wx: 



Wx 



npxRvin,npx,Wx,C{W)) +npx{npx - I) Rc{n,npx,Wx, C(W)) 



so that, if npx > 2, 
(8) 

Rcin,npx,Wx,W) 



-1 



npx - 1 

Combining ([7]) and ([8]), we obtain 



Rv{n,npx,Wx,C{W)) and Rv{n,l,Wx, C{W)) = 



(9) 



E 



w 



PX 



Wx 



Rv{n,npx,Wx,C{W)) 

Wxn^Px 
npx 



npx>2 



C.2 



npx — 1 ^''^ npx — 1 

Combining Q and ([5]) (resp. ([9]) and (l6|)), we have the following expressions for pi and p2- 



(10) 
(11) 



Pii-m) = E 
AeA„ 

M'rn) = E 
A6A„ 



Ri,wi^^Px)'^ 



npx>2 



R 



aw 



T1?PX 

>,Pa)1 



npx 



npx>2 



n^px 



npx - 1 

npx 
npx - 1 



Sx- 



Sx: 



1 



npx - 1 
1 

n_pA - 1 



c.2 
'-'A,! 



c.2 
'-'A,! 



Remark that the terms of the sum for which npx = 1 are all equal to zero, which can be ensured 



with the convention x cxd = since i?i^H^(n,n '^) = i?2,H^( 



n, n 



0. The result follows. □ 



3.4. Concentration ofp{: detailed proof. Within the proof of Prop. 9, we used Lemma 4 in 
order to control the deviations of E^'" [pI(?Ti-)] around its expectation. Implicitly, we used the 
following lemma (which is indeed a straightforward consequence of Lemma 4). 



Lemma 2. We assume that minxi^km {"f^Px} ^ Bn > 1. 
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1. Lower deviations: let ci = 0.184. For all x > 0, with probability at least 1 — e ^ , 
(12) E^^[p{im)]>E[pi{m)]-e-{x,Bn,Dm,A,a^ir.)x^[p2{m)] 

with 6' 



L 



'^l{ciBn) + -j2- 



2. Upper deviations: let C2 = 0.28 and C4 = 0.09. For every x > 0, with probability at least 
1 -e-^, 



(13) 



with 



E^- [pi{m) ] < E [pT(m)] + e+{x, Bn,Drn, A, a^^in)^ [P2{m)] 
A^ 



3+ . 



L 



(C2-B„) + 



Proof. From (19) and (37), we have an explicit expression for pi. We then apply Lemma 4, 
with Xx = npx and ax = px {(^xf' > 0. For we used the general upper bound 



rnax {cjxf 51 - ^ 



□ 



Remark 2. li Bn> yc^^ ^ c^^j ln(n), for every 7 > 0, 

r V0+(7ln(n),S„,D^,yl,cTn,in) < L^ylV-2^Z)-i/2 in(n) 

since Dm < n. 
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